Data management system and method of controlling

ABSTRACT

A storage system. In some embodiments, the storage system includes a plurality of object stores, and a plurality of data managers, connected to the object stores. The plurality of data managers may include a plurality of processing circuits. A first processing circuit of the plurality of processing circuits may be configured to process primarily input-output operations, and a second processing circuit of the plurality of processing circuits may be configured to process primarily input-output completions.

CROSS-REFERENCE TO RELATED APPLICATION(S)

The present application claims priority to and the benefit of (i) U.S. Provisional Application No. 62/969,623, filed Feb. 3, 2020, entitled “DATA STORAGE PLATFORM”, (ii) U.S. Provisional Application No. 62/969,650, filed Feb. 3, 2020, entitled “DATA MANAGEMENT SYSTEM”, and (iii) U.S. Provisional Application No. 62/970,158, filed Feb. 4, 2020, entitled “DATA MANAGEMENT SYSTEM”; the entire contents of all of the documents identified in this paragraph are incorporated herein by reference.

SECTION I Deterministic Storage IO Latencies for Scalable Performance in a Distributed Object Store

The text in the present “Section I” of the Specification, including any reference numerals or characters and any references to figures, refer and correspond to the FIGS. 1-3 with the label “Section I”, and does not refer or correspond to the text in sections II-IV, nor any of the reference numerals, characters, or figures with the labels on the figure sheets that have the label “Section II”, “Section III”, or “Section IV”. That is, each of the Sections I-IV in the present Specification should be interpreted in the context of the corresponding description in the same section and the figures labeled with the same section, respectively. Notwithstanding the foregoing, however, various aspects and inventive concepts of the various sections may be applied to aspects and inventive concepts of other sections.

FIELD

One or more aspects of embodiments according to the present disclosure relate to storage systems, and more particularly to a system and method for improving input-output latency, and for improving the consistency of input-output latency, in a storage system.

BACKGROUND

A storage system based on a distributed object store may have various advantages, including scalability to large size and high storage capacity. Such a system, however, may suffer from high input-output latency, or from variable input-output latency, in part because of contention for computing resources by multiple threads processing input-output operations and processing input-output completions.

Thus, there is a need for a system and method for improving input-output latency, and for improving the consistency of input-output latency, in a storage system.

SUMMARY

According to an embodiment of the present disclosure, there is provided a storage system, including: a plurality of object stores; and a plurality of data managers, connected to the object stores, the plurality of data managers including a plurality of processing circuits, a first processing circuit of the plurality of processing circuits being configured to process primarily input-output operations, and a second processing circuit of the plurality of processing circuits being configured to process primarily input-output completions.

In some embodiments, the second processing circuit is configured to execute a single software thread.

In some embodiments, the processing of the input-output operations includes writing data for data resiliency.

In some embodiments, the writing of data for data resiliency includes: writing first data to a first object store of the plurality of object stores; and writing the first data to a second object store of the plurality of object stores.

In some embodiments, the writing of data for data resiliency includes: writing first data to a first object store of the plurality of object stores; and writing parity data corresponding to the first data to a second object store of the plurality of object stores.

In some embodiments, the first processing circuit is configured to process at least 10 times as many input-output operations as input-output completions.

In some embodiments, the second processing circuit is configured to process at least 10 times as many input-output completions as input-output operations.

In some embodiments, the first processing circuit is configured to process only input-output operations.

In some embodiments, the second processing circuit is configured to process only input-output completions.

In some embodiments, the first processing circuit is configured to process only input-output operations.

In some embodiments, the first processing circuit is a first core of a first data manager, and the second processing circuit is a second core of the first data manager.

In some embodiments: a first data manager of the plurality of data managers includes a plurality of cores; a first subset of the plurality of cores is configured to process primarily input-output operations and a second subset of the plurality of cores is configured to process primarily input-output completions; and the first subset includes at least 10 times as many cores as the second subset.

In some embodiments, the first subset includes at most 100 times as many cores as the second subset.

According to an embodiment of the present disclosure, there is provided a method for operating a storage system including a plurality of object stores and a plurality of data managers, the data managers being connected to the object stores, the method including: receiving a plurality of contiguous requests to perform input-output operations; processing, by a plurality of processing circuits of the data managers, the input-output operations; and processing, by the plurality of processing circuits of the data managers, a plurality of input-output completions corresponding to the input-output operations, wherein: a first processing circuit of the plurality of processing circuits processes primarily input-output operations, and a second processing circuit of the plurality of processing circuits processes primarily input-output completions.

In some embodiments, the second processing circuit is configured to execute a single software thread.

In some embodiments, the processing of the input-output operations includes writing data for data resiliency.

In some embodiments, the writing of data for data resiliency includes: writing first data to a first object store of the plurality of object stores; and writing the first data to a second object store of the plurality of object stores.

In some embodiments, the writing of data for data resiliency includes: writing first data to a first object store of the plurality of object stores; and writing parity data corresponding to the first data to a second object store of the plurality of object stores.

According to an embodiment of the present disclosure, there is provided a storage system, including: means for storing objects; and a plurality of data managers, connected to the means for storing objects, the plurality of data managers including a plurality of processing circuits, a first processing circuit of the plurality of processing circuits being configured to process primarily input-output operations, and a second processing circuit of the plurality of processing circuits being configured to process primarily input-output completions.

In some embodiments, the second processing circuit is configured to execute a single software thread.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims, and appended drawings wherein:

FIG. 1 is a block diagram of a storage system, according to an embodiment of the present disclosure;

FIG. 2 is an illustration of thread contention; and

FIG. 3 is a block diagram of a storage system, according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

The detailed description set forth below in connection with the appended drawings is intended as a description of some example embodiments of a system and method for a system and method for improving input-output latency, and for improving the consistency of input-output latency, in a storage system, provided in accordance with the present disclosure and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. It is to be understood, however, that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. As denoted elsewhere herein, like element numbers are intended to indicate like elements or features.

FIG. 1 shows a storage system, in some embodiments. Each of a plurality of object stores, or “key value stores” 105 provides persistent storage, through an object store interface. Each of a plurality of data managers 110 receives data storage and retrieval requests from clients such as applications and workloads, and, for each such request, sends one or more corresponding input-output commands to one or more of the object stores 105. The object stores 105 may then process input-output operations in accordance with the input-output commands. Each object store 105 may send, upon completion of each of the input-output operations, one or more corresponding input-output completions to the data managers 110, to confirm that the processing of the input-output operation has been completed. For each input-output command, the object store 105 may send a single input-output completion, or it may send a plurality of input-output completions (e.g., if multiple data operations are involved, to ensure data resiliency, as discussed in further detail below). The object stores 105 may communicate with the data managers 110 through an object store interface. The data managers 110 may communicate with the clients through any suitable interface, e.g., through an object store interface, through a Server Message Block (SMB) interface, or through a Network File System (NFS) (e.g., NFSv3 or NFSv4) interface. The storage system of FIG. 1 may be referred to as a “distributed object store”.

A scalable storage solution, such as that illustrated in FIG. 1 , may employ a large number of parallel threads, running in the data managers, to achieve performance. To achieve good performance with parallel-threading, it may be advantageous for these threads to run as concurrently as possible with little or no contention between them. Also important for achieving good performance is consistent storage input-output (10) latency. The nature of a distributed object store may however present challenges to the requirements of concurrency and consistent storage IO latencies.

The storage IO flow at the front end of the object store (e.g., involving the data managers 110) is organized in terms of clients (e.g., applications or workflows) that are connecting to the object store. However, the back end of the storage IO flow is organized in terms of the internal components (e.g., the object stores 105) of the distributed object store. The object stores 105 of the back end may have a significant impact on the performance of the storage system. The object stores 105 may be used by the storage system to manage persistency of the data and also to provide the data replication or saving of other redundant data (e.g., parity data) to ensure data resiliency.

As used herein, “data resiliency” means the ability to recover stored data even if one (or, in some cases, more than one) of the object stores 105 fails or is taken off line. To accomplish this, data replication may be used; e.g., when a quantity of data is written to a first one of the object stores 105, the same data may be written to one or more other object stores 105. In some embodiments, writing data for data resiliency may include, when a quantity of data is written to a first one of the object stores 105, writing parity data to one or more of the other object stores 105. As used herein, “parity data” are any redundant data used as part of an erasure code, i.e., as part of an encoding system that allows the recovery of erased data, or of data that have otherwise become unavailable (e.g., as a result of the object stores 105 failing or being taken off line).

The dichotomy in the nature of front end and back end storage IO flows in the distributed object store presents a challenge to the requirements of concurrency and consistency for a high performance object store. It may cause multiple front-end storage IO streams to compete with unrelated storage IO streams for the resources, within the data managers 110, needed to issue and process completions of storage IO requests. This challenge may be further aggravated by the need for a distributed object store to perform data replication for resiliency, as mentioned above.

These conditions affect concurrency between the threads and the ability to achieve consistent storage IO latencies for a high performance object store. This challenge becomes more acute as the data managers 110 employ more threads for performance, as the use of a larger number of threads may lead to more contention between (i) the parallel threads performing the processing of input-output operations and (ii) the parallel threads performing processing of input-output completions.

FIG. 2 depicts the threat to concurrency and consistency due to the above-explained conditions that may exist in a distributed object store. In FIG. 2 , the data managers 110 are shown, for ease of illustration, as a single line. The object stores 105 may belong to different nodes in a distributed system. For example, the object stores 105 may be implemented in a plurality of nodes each of which may include one or more CPUs (central processing units) and a plurality of persistent storage devices (e.g., solid state drives or hard drives). Each persistent storage device may provide the storage for several object stores 105; for example, a persistent storage device may provide the storage for (i) an object store 105 for storing immutable objects, (ii) an object store 105 for storing objects for which overwrite operations are permitted, and (iii) an object store 105 for storing objects for which append operations are permitted. The threads running in the data managers 110 (e.g., both types of threads, (i) the threads performing the processing of input-output operations and (ii) the threads performing processing of input-output completions) may contend (i) for resources in the data managers 110 and (ii) for the locks on the object stores 105 that may be needed for some operations.

These challenges to the concurrency in the storage IO path may be particularly present in a distributed object store for several reasons. First, the distributed nature of the storage IO flow may mean that in a distributed system, the complete flow for an input-output operation may span multiple data managers 110. Second, the implementation of data resiliency in a distributed system may mean that the distributed system is configured to be able to serve data even in the face of individual component failures. Third, distributed systems may achieve data resiliency by replicating data across several components (e.g., across several object stores 105). The data resiliency requirements may mean that the relationship between input-output completions and input-output operations may be a many to one relationship, e.g., a plurality of input-output completions may correspond to a single input-output operation. As such, the challenge for consistent performance in a distributed object store may be to leverage the concurrent nature of the front-end storage IO path and at the same time provide consistent storage IO latencies at scale.

As such, the storage IO latencies in a large distributed system may become unpredictable with scale as storage IO completion processing competes for common resources with the processing involved with the influx of new storage IO operations into the distributed object store system. This competition between (i) the processing involved in the influx of new IO operations into the system and (ii) the processing involved in processing IO completions of existing IO operations already in the system may create unpredictably in the latencies seen by data storage and retrieval requests from clients such as applications and workloads.

In some embodiments, these challenges are addressed using a dedicated subsystem that processes IO completions, as illustrated in FIG. 3 . This structure insulates the IO completion processing from the scale of the input-output operations, and reduces contention between the processing of input-output operations and the processing of input-output completions.

For example, one or more hardware threads may be dedicated for processing IO completions, thereby providing dedicated compute resources so the IO completion processing may scale with the number of threads preforming IO operations. As used herein, a “hardware thread” is a thread that is confined to a single processing element (e.g., a single core or a multi-core processor), and that shares the processing element with a limited number of other (software) threads. As used herein, a “processing element” or “core” is a processing circuit configured to fetch instructions and execute them.

The number of processing elements dedicated to processing input-output completions may scale with the number of processing elements used for processing of input-output operations. For example, if each of the data managers 110 has 72 cores, then 1, 2 or 4 of the cores in each of the data managers 110 may be dedicated to processing input-output completions. In some embodiments the ratio of (i) the number of cores used for processing input-output operations to (ii) the number of cores used for processing input-output completions is between 10 and 100. The proportion of cores allocated to processing input-output completions may be selected in accordance with the relative processing burdens of processing input-output completions and of processing input-output operations; if too few cores are allocated for processing input-output completions, the input-output completions latency may be excessive and excessively variable, and if too many cores are allocated for processing input-output completions, the aggregate performance of the (fewer) cores processing input-output operations may be needlessly degraded.

In the embodiment of FIG. 3 , each of the object stores 105 may have a first queue 305 for input-output commands received from the data managers 110 and a second queue 310 for input-output completions generated by the object store 105. A plurality of first cores 315 is configured to process input-output operations, and a second core 320 is dedicated to processing input-output completions.

The dedication of hardware to the processing of input-output completions may significantly improve the extent to which IO completion processing times remain deterministic and independent of the number of threads processing input-output operations. To further reduce contention between the processing input-output operations and the processing input-output completions, a single software thread (or a fixed, limited number of software threads) pinned to the dedicated hardware may perform the processing of input-output completions. This may avoid context switches, further improving the extent to which IO completion processing times are deterministic; it may also eliminate the overhead of scheduling and any unpredictability that otherwise might result from such scheduling, thus helping to achieve consistent latencies.

In some embodiments, the hardware used for processing input-output completions, instead of being exclusively dedicated to such processing, is used to perform primarily such processing (e.g., also handling other processing tasks, such as processing some input-output operations but devoting more of its processing resources to the processing of input-output completions than to the processing of input-output operations. As used herein, if a piece of hardware (e.g., a processing circuit) is described as performing, or being configured to perform “primarily” a certain type of task (e.g., the processing of input-output completions), it means that the piece of hardware devotes more time (i.e., more processing cycles) to this type of task than to any other type of task. In some embodiments a first core is used primarily for processing input-output operations and it processes at least ten times as many input-output operations as input-output completions, and a second core is used primarily for processing input-output completions, and it processes at least ten times as many input-output completions as input-output operations. In some embodiments, the processing includes receiving a plurality of contiguous requests to perform input-output operations, and, during the handling of these requests, the first core processes primarily input-output operations and the second core processes primarily input-output completions. As used herein, a set of “contiguous IO requests” is the complete set of IO requests received during some interval of time, i.e., it is a set of IO requests that are received consecutively or concurrently.

As used herein, “a portion of” something means “at least some of” the thing, and as such may mean less than all of, or all of, the thing. As such, “a portion of” a thing includes the entire thing as a special case, i.e., the entire thing is an example of a portion of the thing. As used herein, when a second number is “within Y %” of a first number, it means that the second number is at least (1−Y/100) times the first number and the second number is at most (1+Y/100) times the first number. As used herein, the term “or” should be interpreted as “and/or”, such that, for example, “A or B” means any one of “A” or “B” or “A and B”.

The term “processing circuit” is used herein to mean any combination of hardware, firmware, and software, employed to process data or digital signals. Processing circuit hardware may include, for example, application specific integrated circuits (ASICs), general purpose or special purpose central processing units (CPUs), digital signal processors (DSPs), graphics processing units (GPUs), and programmable logic devices such as field programmable gate arrays (FPGAs). In a processing circuit, as used herein, each function is performed either by hardware configured, i.e., hard-wired, to perform that function, or by more general-purpose hardware, such as a CPU, configured to execute instructions stored in a non-transitory storage medium. A processing circuit may be fabricated on a single printed circuit board (PCB) or distributed over several interconnected PCBs. A processing circuit may contain other processing circuits; for example, a processing circuit may include two processing circuits, an FPGA and a CPU, interconnected on a PCB.

As used herein, the term “array” refers to an ordered set of numbers regardless of how stored (e.g., whether stored in consecutive memory locations, or in a linked list). As used herein, when a method (e.g., an adjustment) or a first quantity (e.g., a first variable) is referred to as being “based on” a second quantity (e.g., a second variable) it means that the second quantity is an input to the method or influences the first quantity, e.g., the second quantity may be an input (e.g., the only input, or one of several inputs) to a function that calculates the first quantity, or the first quantity may be equal to the second quantity, or the first quantity may be the same as (e.g., stored at the same location or locations in memory as) the second quantity.

It will be understood that, although the terms “first”, “second”, “third”, etc., may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section, without departing from the spirit and scope of the inventive concept.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concept. As used herein, the terms “substantially,” “about,” and similar terms are used as terms of approximation and not as terms of degree, and are intended to account for the inherent deviations in measured or calculated values that would be recognized by those of ordinary skill in the art.

As used herein, the singular forms “a” and “an” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. Further, the use of “may” when describing embodiments of the inventive concept refers to “one or more embodiments of the present disclosure”. Also, the term “exemplary” is intended to refer to an example or illustration. As used herein, the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively.

It will be understood that when an element or layer is referred to as being “on”, “connected to”, “coupled to”, or “adjacent to” another element or layer, it may be directly on, connected to, coupled to, or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being “directly on”, “directly connected to”, “directly coupled to”, or “immediately adjacent to” another element or layer, there are no intervening elements or layers present.

Any numerical range recited herein is intended to include all sub-ranges of the same numerical precision subsumed within the recited range. For example, a range of “1.0 to 10.0” or “between 1.0 and 10.0” is intended to include all subranges between (and including) the recited minimum value of 1.0 and the recited maximum value of 10.0, that is, having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0, such as, for example, 2.4 to 7.6. Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein and any minimum numerical limitation recited in this specification is intended to include all higher numerical limitations subsumed therein.

One or more embodiments according the present disclosure may include one or more characteristics of one or more of the following clauses (although embodiments are not limited thereto):

Clause 1. A storage system, comprising: a plurality of object stores; and a plurality of data managers, connected to the object stores, the plurality of data managers comprising a plurality of processing circuits, a first processing circuit of the plurality of processing circuits being configured to process primarily input-output operations, and a second processing circuit of the plurality of processing circuits being configured to process primarily input-output completions. Clause 2. The storage system of clause 1, wherein the second processing circuit is configured to execute a single software thread. Clause 3. The storage system of clause 1, wherein the processing of the input-output operations comprises writing data for data resiliency. Clause 4. The storage system of clause 3, wherein the writing of data for data resiliency comprises: writing first data to a first object store of the plurality of object stores; and writing the first data to a second object store of the plurality of object stores. Clause 5. The storage system of clause 3, wherein the writing of data for data resiliency comprises: writing first data to a first object store of the plurality of object stores; and writing parity data corresponding to the first data to a second object store of the plurality of object stores. Clause 6. The storage system of clause 1, wherein the first processing circuit is configured to process at least 10 times as many input-output operations as input-output completions. Clause 7. The storage system of clause 6, wherein the second processing circuit is configured to process at least 10 times as many input-output completions as input-output operations. Clause 8. The storage system of clause 1, wherein the first processing circuit is configured to process only input-output operations. Clause 9. The storage system of clause 1, wherein the second processing circuit is configured to process only input-output completions. Clause 10. The storage system of clause 9, wherein the first processing circuit is configured to process only input-output operations. Clause 11. The storage system of clause 1, wherein the first processing circuit is a first core of a first data manager, and the second processing circuit is a second core of the first data manager. Clause 12. The storage system of clause 1, wherein: a first data manager of the plurality of data managers comprises a plurality of cores; a first subset of the plurality of cores is configured to process primarily input-output operations and a second subset of the plurality of cores is configured to process primarily input-output completions; and the first subset includes at least 10 times as many cores as the second subset. Clause 13. The storage system of clause 12, wherein the first subset includes at most 100 times as many cores as the second subset. Clause 14. A method for operating a storage system comprising a plurality of object stores and a plurality of data managers, the data managers being connected to the object stores, the method comprising: receiving a plurality of contiguous requests to perform input-output operations; processing, by a plurality of processing circuits of the data managers, the input-output operations; and processing, by the plurality of processing circuits of the data managers, a plurality of input-output completions corresponding to the input-output operations, wherein:

a first processing circuit of the plurality of processing circuits processes primarily input-output operations, and

a second processing circuit of the plurality of processing circuits processes primarily input-output completions.

Clause 15. The method of clause 14, wherein the second processing circuit is configured to execute a single software thread. Clause 16. The method of clause 14, wherein the processing of the input-output operations comprises writing data for data resiliency. Clause 17. The method of clause 16, wherein the writing of data for data resiliency comprises: writing first data to a first object store of the plurality of object stores; and writing the first data to a second object store of the plurality of object stores. Clause 18. The method of clause 16, wherein the writing of data for data resiliency comprises: writing first data to a first object store of the plurality of object stores; and writing parity data corresponding to the first data to a second object store of the plurality of object stores. Clause 19. A storage system, comprising: means for storing objects; and a plurality of data managers, connected to the means for storing objects, the plurality of data managers comprising a plurality of processing circuits, a first processing circuit of the plurality of processing circuits being configured to process primarily input-output operations, and a second processing circuit of the plurality of processing circuits being configured to process primarily input-output completions. Clause 20. The storage system of clause 19, wherein the second processing circuit is configured to execute a single software thread.

Although aspects of some embodiments of a system and method for improving input-output latency, and for improving the consistency of input-output latency, in a storage system, have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Accordingly, it is to be understood that a system and method for improving input-output latency, and for improving the consistency of input-output latency, in a storage system, constructed according to principles of this disclosure may be embodied other than as specifically described herein. The invention is also defined in the following claims, and equivalents thereof.

SECTION II Scaling Read Cache in a Distributed Object Store Through System Wide State

The text in the present “Section II” of the Specification, including any reference numerals or characters and any references to figures, refer and correspond to the FIGS. 4-7 with the label “Section II”, and does not refer or correspond to the text in sections I, III, or IV, nor any of the reference numerals, characters, or figures with the labels on the figure sheets that have the label “Section I”, “Section III”, or “Section IV”. That is, each of the Sections I-IV in the present Specification should be interpreted in the context of the corresponding description in the same section and the figures labeled with the same section, respectively. Notwithstanding the foregoing, however, various aspects and inventive concepts of the various sections may be applied to aspects and inventive concepts of other sections.

FIELD

One or more aspects of embodiments according to the present disclosure relate to data storage, and more particularly to a metadata cache for a distributed object store and a system and method for scaling the node-local metadata read cache in a distributed object store using system wide state.

BACKGROUND

In a distributed object store, metadata access may be relatively burdensome, especially as the distributed object store is scaled to larger sizes.

Thus, there is a need for an improved system and method for managing metadata in a distributed object store.

SUMMARY

According to an embodiment of the present disclosure, there is provided a method for managing metadata in a distributed object store including a plurality of nodes and persistent storage, the persistent storage storing a plurality of objects and metadata for the objects, the method including, during a first time interval: generating, by a first node, a first value for first metadata for a first object of the plurality of objects; storing, by the first node of the plurality of nodes, the first value for the first metadata in a metadata cache of the first node; generating, by the first node, a first value for a first cookie, the first cookie corresponding to the first metadata; storing the first value for the first cookie in a local storage location of the first node; and storing the first value for the first cookie in a cluster-wide storage location.

In some embodiments, the method further includes, during a second time interval following the first time interval: reading a second value for the first cookie from the cluster-wide storage location; reading the first value for the first cookie from the local storage location of the first node; and comparing, by the first node, the second value for the first cookie to the first value for the first cookie.

In some embodiments, the method further includes: determining that the second value equals the first value; and retrieving the first metadata from the metadata cache of the first node.

In some embodiments, the method further includes accessing the first object based on the retrieved first metadata.

In some embodiments, the method further includes during a third time interval following the first time interval and preceding the second time interval: generating, by a second node of the plurality of nodes, a second value for the first metadata; storing, by the second node, the second value for the first metadata in a metadata cache of the second node; generating, by the second node, a second value for the first cookie, different from the first value for the first cookie; storing the second value for the first cookie in a local storage location of the second node; and storing the second value for the first cookie in the cluster-wide storage location.

In some embodiments, the method further includes, during a second time interval following the third time interval: reading the second value for the first cookie from the cluster-wide storage location; reading, by the first node, the first value for the first cookie from the local storage location of the first node; and comparing, by the first node, the second value for the first cookie to the first value for the first cookie.

In some embodiments, the method further includes: determining, by the first node, that the second value for the first cookie does not equal the first value for the first cookie; and retrieving, by the first node, the first metadata from the persistent storage.

In some embodiments, the method further includes accessing the first object based on the retrieved first metadata.

In some embodiments, the method further includes: calculating, by the first node, during the first time interval, a node identifier for a node containing the cluster-wide storage location.

In some embodiments, the calculating of the node identifier includes calculating a hash based on a property of the metadata.

In some embodiments, the first cookie includes: an identifier, identifying the first metadata, and a signature identifying the writing instance.

In some embodiments, the signature includes a time of day.

According to an embodiment of the present disclosure, there is provided a distributed object store, including: a plurality of nodes; and persistent storage, the persistent storage storing a plurality of objects and metadata for the objects, the nodes including a plurality of processing circuits including a first processing circuit in a first node of the plurality of nodes, the first processing circuit being configured, during a first time interval, to: generate a first value for first metadata for a first object of the plurality of objects; store the first value for the first metadata in a metadata cache of the first node; generate a first value for a first cookie, the first cookie corresponding to the first metadata; store the first value for the first cookie in a local storage location of the first node; and store the first value for the first cookie in a cluster-wide storage location.

In some embodiments, the first processing circuit is further configured, during a second time interval following the first time interval, to: read a second value for the first cookie from the cluster-wide storage location; read the first value for the first cookie from the local storage location of the first node; and compare the second value for the first cookie to the first value for the first cookie.

In some embodiments, the first processing circuit is further configured to: determine that the second value equals the first value; and retrieve the first metadata from the metadata cache of the first node.

In some embodiments, the first processing circuit is further configured to access the first object based on the retrieved first metadata.

In some embodiments, a second processing circuit, of a second node of the plurality of nodes, is configured, during a third time interval following the first time interval and preceding the second time interval, to: generate a second value for the first metadata; store the second value for the first metadata in a metadata cache of the second node; generate a second value for the first cookie, different from the first value for the first cookie; store the second value for the first cookie in a local storage location of the second node; and store the second value for the first cookie in the cluster-wide storage location.

In some embodiments, the first processing circuit is further configured, during a second time interval following the third time interval, to: read the second value for the first cookie from the cluster-wide storage location; read the first value for the first cookie from the local storage location of the first node; and compare the second value for the first cookie to the first value for the first cookie.

In some embodiments, the first processing circuit is further configured to: determine that the second value for the first cookie does not equal the first value for the first cookie; and retrieve the first metadata from the persistent storage.

According to an embodiment of the present disclosure, there is provided a distributed object store, including: a plurality of nodes; and persistent storage means, the persistent storage mean storing a plurality of objects and metadata for the objects, the nodes including a plurality of processing circuits including a first processing circuit in a first node of the plurality of nodes, the first processing circuit being configured to: generate a first value for first metadata for a first object of the plurality of objects; store the first value for the first metadata in a metadata cache of the first node; generate a first value for a first cookie, the first cookie corresponding to the first metadata; store the first value for the first cookie in a local storage location of the first node; and store the first value for the first cookie in a cluster-wide storage location.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims, and appended drawings wherein:

FIG. 4 is a block diagram of a storage system, according to an embodiment of the present disclosure;

FIG. 5A is a data layout for a cookie, according to an embodiment of the present disclosure;

FIG. 5B is a data layout for a metadata entry, according to an embodiment of the present disclosure;

FIG. 6 is an access heat graph, according to an embodiment of the present disclosure;

FIG. 7A is a data access sequence diagram, according to an embodiment of the present disclosure;

FIG. 7B is a data access sequence diagram, according to an embodiment of the present disclosure;

FIG. 7C is a data access sequence diagram, according to an embodiment of the present disclosure; and

FIG. 7D is a data access sequence diagram, according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

The detailed description set forth below in connection with the appended drawings is intended as a description of example embodiments of a metadata cache for a distributed object store provided in accordance with the present disclosure and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. It is to be understood, however, that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. As denoted elsewhere herein, like element numbers are intended to indicate like elements or features.

FIG. 4 shows a storage system, or “cluster”, in some embodiments. Each of a plurality of object stores, or “key value stores” 105 provides persistent storage, through an object store interface. Each of a plurality of data managers (or “nodes”) 110 receives storage requests (e.g., data write, modify, or read requests) from clients such as client applications and workloads, and, for each such request, sends one or more corresponding input-output commands (or “I/Os”, or “IOs”) to one or more of the object stores 105. The object stores 105 may communicate with the data managers 110 through an object store interface. The data managers 110 may communicate with the clients through any suitable interface, e.g., through an object store interface, through a Server Message Block (SMB) interface, or through a Network File System (NFS) (e.g., NFSv3 or NFSv4) interface. The storage system of FIG. 4 may be referred to as a “distributed object store”. Each of the data managers 110 may be a node including one or more processing circuits. Each of the object stores 105 may also be a node, including one or more processing circuits, and further including persistent data storage means, e.g., one or more persistent storage devices such as solid-state drives. The data managers 110 and the object stores 105 may be connected together in a manner allowing each node to communicate with any of the other nodes, and the methods disclosed herein may be performed by one or more of the processing circuits, e.g., by several processing circuits that are in communication with each other. In the following text, unless the context indicates otherwise, a “node” is a data manager 110.

A distributed object stores such as that of FIG. 1 may require sophisticated metadata to manage different aspects of each object, such as growth or shrinkage of the object, data management policies, object versioning necessitated by the immutable nature of some of the object stores 105, or obtaining the location of the object data on a data store given an object identifier. The nodes 110 may rely on metadata when accessing the objects in the distributed object stores, e.g., a node 110 may first retrieve the metadata for an object before accessing the object.

These aspects of the object may involve multiple metadata accesses in the input-output (IO) path before the appropriate object data can be fetched. As such, it may be necessary for an object store to access the persistent store multiple times for processing the metadata before the object data can be fetched. The performance of a scalable distributed object store may therefore be dependent upon the performance with which the required metadata is accessed, and the performance of the distributed object store may be improved if the distributed object store is able to maintain low latencies for metadata accesses. In some embodiments, low latency access is achieved by caching the metadata to avoid having to access the underlying persistent store for every metadata access.

While caching may improve performance, it may also impose costs. For example, in order to maintain cache coherency across the distributed object store the metadata cache may hold a system wide lock on any data item that is cached. This lock may be part of the system wide state for each entry that is in the cache. In such an embodiment, the system wide state of each entry in the cache ensures that the client application always gets data from the cache that is coherent across the distributed object store.

As the client application usage of the distributed object store by the client application continues, the amount of data that is cached in the metadata cache may increase. This increase of data in the cache may improve performance because the input-output (IO) path of the distributed object store can find the metadata it needs in the cache and hence avoid relatively slow accesses to persistent storage. However, a cost of maintaining cache coherency of this data across the distributed object store may be the increased shared system state (or “shared state”) that the cache needs to maintain for the cached data.

However, the amount of shared state held by the cache in each node may negatively affect the recovery and error handling processing that may be performed to handle failure scenarios in a distributed object store. For example, the distributed object store, when performing error recovery, may appropriately handle the shared state that was owned by a node that has gone offline. Components in a distributed object store may go offline for several reasons, such as reasons related to software, hardware, or network failure. As such, the efficiency of the error handling process for these scenarios may impact the consistent performance that a distributed system can provide to the client application. The increase in the amount of shared state held by a node in the distributed object store may increase the time required to perform error handling that may be required to recover from a node failure.

As the distributed object store is scaled to larger sizes (e.g., to include more nodes), it may more frequently be the case that individual nodes go offline either due to failures or due to planned maintenance operations, and error recovery may become the norm rather than the exception. An optimal error recovery approach may therefore be important for enabling scaling in the distributed object store. A challenge, in implementing such an approach, may be to cache large amounts of data in the metadata cache for longer times to improve I/O performance without negatively impacting error recovery processes that may be performed with increasing frequency as the distributed object store scales to larger sizes.

In some embodiments, a solution to the challenge of achieving both scalability and performance in a distributed object store includes three parts. First, the metadata cache intelligently categorizes the entries in the cache into one of two categories: (i) entries in the cache that require the cache to hold a system wide shared state for the data in the entry, and (ii) entries in the cache for which the cache will not hold a system wide shared state for the data in the entry. Second, the metadata cache continues to cache the data for both of the categories of cache entries. The metadata cache may, however, mark for eviction cache entries in the second category. Third, for every entry in the metadata cache, the cache maintains a cookie that includes (e.g., consists of) the following information: (i) the identifier (ID) of the data in the cache, and (ii) a unique signature generated by the cache. As used herein, to “hold shared state” means to hold a system-wide lock on data (corresponding to the shared state) stored in the metadata cache. The system-wide lock may be a shared lock if the node is accessing the data in read-only mode, or an exclusive lock if the node is writing the data. “Dropping” or “releasing” shared state means releasing the system-wide lock.

FIG. 5A shows the data layout of a cookie used by the cache for this purpose, in some embodiments. The cookie created by the metadata cache is used to determine whether the cached data in the cache entry is valid, when the entry is accessed by the node. As used herein, a “cookie” is a data structure. The word “cookie” is used because it may be advantageous for the data structure to be relatively small, e.g., having a size less than 100 kB, but the term “cookie” as used herein is not limited to such small data structures and may refer to a data structure of any size. The cookies may be stateless (not stored in persistent storage) and therefore need not be recovered for error recovery or configuration change purposes. As a result, the handling of error recovery need not be affected by the scale of the cookies, and the absence of cookies need not impact the validity of the cache entries that are being actively used by the client application.

As such, the cookies may not affect the error recovery processes of the distributed object store. When data is cached, the metadata cache may cache the cookie along with data in the cache entry. FIG. 5B shows a cache entry that is caching data.

The method used by the metadata cache to determine whether an entry requires the cache to hold shared state across the distributed object store may include the following two aspects. In a first aspect, the cache may keep track of IOs that are actively referencing a cache entry. As long as the entry is referenced by one or more active IOs, the metadata cache may hold shared state (e.g., maintain a system-wide lock) on that entry to keep the data coherent across the distributed object store.

In a second aspect, once the number of active references to a cache entry becomes zero, the entry may be considered inactive. The metadata cache may, however, continue to cache the data based on an internal valuation of the data of the entry. This may allow subsequent accesses to this data to be performed with significantly reduced latency, compared to retrieving the data from persistent storage. The valuation may be used by the metadata cache to determine the likelihood of that entry being referenced by active IOs in the near future. The valuation may be a numerical value based on the frequency of access during a recent time interval, or it may be proportional to a measure of access heat, as discussed in further detail below. During such an inactive period the cache entry is not actively referenced by IOs but the cache continues to hold the data.

A method to cache data without holding the shared state may be practiced as follows. When a cache entry is considered inactive, the metadata cache may release the shared state associated with that entry. The cache may however continue to maintain the data in the cache along with the cookie. The cookie in the cache entry may be used to determine whether the data in the cache is still valid if that cache entry again becomes active due references from subsequent IOs. This allows the metadata cache to limit the amount of shared state it needs to hold to only the subset of entries in the cache that are in active use.

Thus, in some embodiments, the metadata cache may be able to cache the increasing amount of data it stores without linearly increasing the amount of shared state held by a node in the distributed object store. FIG. 6 shows the lifecycle of the cache entry and the time period during which the shared state can be released while still keeping the data cached. The “access heat” of a metadata cache entry may be calculated by incrementing the access heat value (e.g., incrementing the access heat value by 1) each time the metadata cache entry is accessed, and periodically replacing the access heat value with the product of the access heat value and a number less than 1 (so that, when there are no accesses, the access heat value decays exponentially).

The eviction logic may make determinations regarding which entries to evict from the metadata cache based on a heat map (i.e., based on the relative access heats of the entries in the metadata cache). In addition to the access heat, the valuation function may take into account the type of metadata. For example, there may be different types of cache entries in the metadata cache, and the system may specify which types are preferred, over which other types, for being kept in the cache. The valuation function may be employed to implement such a preference, i.e., a priority of one type of data over another type. For example, a file header entry may be a higher priority type than a data range header.

In some embodiments, the method to determine the validity of data in a metadata cache entry without holding shared state is as follows. When metadata is first accessed in the distributed object store, steps to cache that metadata are as follows. First, the metadata cache may initialize the cookie for the metadata cache entry with a unique identifier that it generates, and which may identify the writing instance (e.g., which may take a different value each time the metadata is written). The unique identifier may, for example, be a combination of the node identifier and (i) the time of day, or (ii) some analogous counter that changes monotonically. A “cluster-wide” value of this cookie is then stored in the memory of one of the nodes in the distributed object store. The node tasked with storing the cookie for this entry may be determined algorithmically based on the object identifier, e.g., using a suitable hash function. A “local”, or “node-local” value of the cookie is also cached in the cache entry itself along with the data as shown in FIG. 5B.

If the same metadata is subsequently accessed from another node in the distributed object store for the first time then the metadata cache on that node may do the following. It may first compute the node where the cluster-wide value of the cookie for the entry is stored, or is to be stored. If the cookie does not exist then this is the first access in the distributed object store (a case discussed above). Otherwise the cookie is read and cached along with the data as shown in FIG. 5B. FIG. 7A shows the cache entry cookie initialization, for the above-described scenario, including numbered steps 1 through 7, which may be performed in order.

When the number of active references to a cached entry drops to zero, the metadata cache may release the shared state. FIG. 7B shows the metadata cache still caching the data with the shared state released.

If the entry is being accessed by another node (e.g., a second node) in an exclusive mode, then the metadata cache may do the following. First, it may compute the node identifier of the node where the cluster-wide value of the cookie for this entry is stored. Second, it may generate a new unique identifier for the entry and store it both in the cluster-wide value of the cookie (i.e., in the node for which the node identifier has been calculated) and in the local value of the cookie. The metadata cache may then use the changed value of the cookie for that entry to determine whether the data cached in that entry on that node is valid or stale. FIG. 7C shows the state of the cache after exclusive access by one of the nodes.

If an inactive entry becomes active again, the metadata cache may do the following. First, it may acquire the shared state on the entry. Next, it may read the cluster-wide value of the cookie for this entry. It may then verify whether the local value of the cookie matches the cluster-wide value. If there is a mismatch, the metadata cache may conclude that the current data in the local cache is stale, and read the data from the underlying persistent store to refresh the data in the local cache. Otherwise (i.e., if the local value of the cookie matches the cluster-wide value) the cached data is still valid and thus the data in the local cache can be reused without having to read from the underlying persistent store. FIG. 7D shows the mechanism at work in the metadata cache.

For example, during a first time interval, a first node may generate a first value for first metadata for a first object of the plurality of objects stored in the distributed object store. As used herein the term to “generate” a value for first metadata encompasses both (i) generating a new value for the first metadata based on a previous value of the first metadata and (ii) generating a value based on other input, e.g., based on the size of the object, or the time at which the object was modified, or a storage location of the object. The first node may then store the first value for the first metadata in the underlying persistent store and in the metadata cache of the first node (as shown, e.g., in step 2 of FIG. 7A). The first node may also generate a first value for a first cookie, the first cookie corresponding to the first metadata (as shown, e.g., in step 3 of FIG. 7A). The first node may then calculate a node identifier for a node containing the cluster-wide storage location (as shown, e.g., in step 4 of FIG. 7A), and store the first value for the first cookie both (i) in a local storage location of the first node and (ii) in the cluster-wide storage location (as also shown, e.g., in step 4 of FIG. 7A). During the first time interval, the first node may hold shared state (e.g., it may hold a system-wide lock) for the first metadata. At the end of the first time interval the first node may release the system-wide lock (as shown, e.g., in FIG. 7B).

During a second time interval, the first node may read the value for the first cookie (e.g., a second value for the first cookie) from the cluster-wide storage location, read the first value for the first cookie from the local storage location of the first node; and compare the second value for the first cookie to the first value for the first cookie, to assess whether the metadata, corresponding to the first cookie, in the metadata cache of the first node, is stale. If there is a mismatch, i.e., if the first node determines (as shown, e.g., in step 2 of FIG. 7D) that the second value for the first cookie does not equal the first value for the first cookie (as a result, e.g., of activity by a second node during an intervening third time interval, as discussed below), the first node may conclude that the metadata, corresponding to the first cookie, in the metadata cache of the first node is stale, and it may therefore retrieve the first metadata from the persistent storage (as shown, e.g., in step 3 of FIG. 7D) (as used herein, “retrieving the first metadata” means retrieving a value for the first metadata). Otherwise (e.g., if the first node determines that the second value equals the first value), the first node may conclude that the metadata, corresponding to the first cookie, in the metadata cache of the first node, is not stale, and the first node may retrieve the first metadata from the metadata cache of the first node (thereby avoiding an access of persistent storage). During the second time interval, the first node may also hold a system-wide lock for the first metadata.

As mentioned above, the behavior of the first node during the second time interval may depend on the actions of another node (e.g., a second node) during an intervening time interval (e.g., a third time interval) between the first time interval and the second time interval. For example, it may be the case that during the third time interval, the second node generates a second value for the first metadata, stores the second value for the first metadata in a metadata cache of the second node and in the underlying persistent store, generates a second value for the first cookie, different from the first value for the first cookie, stores the second value for the first cookie in a local storage location of the second node, and stores the second value for the first cookie in the cluster-wide storage location (as shown, e.g., in FIG. 7C). In such a case the changed value of the first cookie in the cluster-wide storage location alerts the first node, during the second time interval, that the value of the first metadata has been changed (and has, therefore become stale in the metadata cache of the first node) and the first node may therefore fetch the first metadata from cache instead of relying on the value in the metadata cache of the first node.

As used herein, “a portion of” something means “at least some of” the thing, and as such may mean less than all of, or all of, the thing. As such, “a portion of” a thing includes the entire thing as a special case, i.e., the entire thing is an example of a portion of the thing. As used herein, when a second number is “within Y %” of a first number, it means that the second number is at least (1−Y/100) times the first number and the second number is at most (1+Y/100) times the first number. As used herein, the term “or” should be interpreted as “and/or”, such that, for example, “A or B” means any one of “A” or “B” or “A and B”.

The term “processing circuit” is used herein to mean any combination of hardware, firmware, and software, employed to process data or digital signals. Processing circuit hardware may include, for example, application specific integrated circuits (ASICs), general purpose or special purpose central processing units (CPUs), digital signal processors (DSPs), graphics processing units (GPUs), and programmable logic devices such as field programmable gate arrays (FPGAs). In a processing circuit, as used herein, each function is performed either by hardware configured, i.e., hard-wired, to perform that function, or by more general-purpose hardware, such as a CPU, configured to execute instructions stored in a non-transitory storage medium. A processing circuit may be fabricated on a single printed circuit board (PCB) or distributed over several interconnected PCBs. A processing circuit may contain other processing circuits; for example, a processing circuit may include two processing circuits, an FPGA and a CPU, interconnected on a PCB.

As used herein, when a method (e.g., an adjustment) or a first quantity (e.g., a first variable) is referred to as being “based on” a second quantity (e.g., a second variable) it means that the second quantity is an input to the method or influences the first quantity, e.g., the second quantity may be an input (e.g., the only input, or one of several inputs) to a function that calculates the first quantity, or the first quantity may be equal to the second quantity, or the first quantity may be the same as (e.g., stored at the same location or locations in memory as) the second quantity.

It will be understood that, although the terms “first”, “second”, “third”, etc., may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section, without departing from the spirit and scope of the inventive concept.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concept. As used herein, the terms “substantially,” “about,” and similar terms are used as terms of approximation and not as terms of degree, and are intended to account for the inherent deviations in measured or calculated values that would be recognized by those of ordinary skill in the art.

As used herein, the singular forms “a” and “an” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. Further, the use of “may” when describing embodiments of the inventive concept refers to “one or more embodiments of the present disclosure”. Also, the term “exemplary” is intended to refer to an example or illustration. As used herein, the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively.

It will be understood that when an element or layer is referred to as being “on”, “connected to”, “coupled to”, or “adjacent to” another element or layer, it may be directly on, connected to, coupled to, or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being “directly on”, “directly connected to”, “directly coupled to”, or “immediately adjacent to” another element or layer, there are no intervening elements or layers present.

Any numerical range recited herein is intended to include all sub-ranges of the same numerical precision subsumed within the recited range. For example, a range of “1.0 to 10.0” or “between 1.0 and 10.0” is intended to include all subranges between (and including) the recited minimum value of 1.0 and the recited maximum value of 10.0, that is, having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0, such as, for example, 2.4 to 7.6. Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein and any minimum numerical limitation recited in this specification is intended to include all higher numerical limitations subsumed therein.

One or more embodiments according the present disclosure may include one or more characteristics of one or more of the following clauses (although embodiments are not limited thereto):

Clause 1. A method for managing metadata in a distributed object store comprising a plurality of nodes and persistent storage, the persistent storage storing a plurality of objects and metadata for the objects, the method comprising, during a first time interval:

generating, by a first node, a first value for first metadata for a first object of the plurality of objects;

storing, by the first node of the plurality of nodes, the first value for the first metadata in a metadata cache of the first node;

generating, by the first node, a first value for a first cookie, the first cookie corresponding to the first metadata;

storing the first value for the first cookie in a local storage location of the first node; and

storing the first value for the first cookie in a cluster-wide storage location.

Clause 2. The method of clause 1, further comprising, during a second time interval following the first time interval:

reading a second value for the first cookie from the cluster-wide storage location;

reading the first value for the first cookie from the local storage location of the first node; and

comparing, by the first node, the second value for the first cookie to the first value for the first cookie.

Clause 3. The method of clause 2, further comprising:

determining that the second value equals the first value; and

retrieving the first metadata from the metadata cache of the first node.

Clause 4. The method of clause 3, further comprising accessing the first object based on the retrieved first metadata.

Clause 5. The method of clause 2, further comprising during a third time interval following the first time interval and preceding the second time interval:

generating, by a second node of the plurality of nodes, a second value for the first metadata;

storing, by the second node, the second value for the first metadata in a metadata cache of the second node;

generating, by the second node, a second value for the first cookie, different from the first value for the first cookie;

storing the second value for the first cookie in a local storage location of the second node; and

storing the second value for the first cookie in the cluster-wide storage location.

Clause 6. The method of clause 5, further comprising, during a second time interval following the third time interval:

reading the second value for the first cookie from the cluster-wide storage location;

reading, by the first node, the first value for the first cookie from the local storage location of the first node; and

comparing, by the first node, the second value for the first cookie to the first value for the first cookie.

Clause 7. The method of clause 6, further comprising:

determining, by the first node, that the second value for the first cookie does not equal the first value for the first cookie; and

retrieving, by the first node, the first metadata from the persistent storage.

Clause 8. The method of clause 7, further comprising accessing the first object based on the retrieved first metadata.

Clause 9. The method of clause 1, further comprising:

calculating, by the first node, during the first time interval, a node identifier for a node containing the cluster-wide storage location.

Clause 10. The method of clause 9, wherein the calculating of the node identifier comprises calculating a hash based on a property of the metadata.

Clause 11. The method of clause 1, wherein the first cookie comprises:

an identifier, identifying the first metadata, and

a signature identifying the writing instance.

Clause 12. The method of clause 11, wherein the signature comprises a time of day.

Clause 13. A distributed object store, comprising:

a plurality of nodes; and

persistent storage, the persistent storage storing a plurality of objects and metadata for the objects,

the nodes comprising a plurality of processing circuits including a first processing circuit in a first node of the plurality of nodes, the first processing circuit being configured, during a first time interval, to:

generate a first value for first metadata for a first object of the plurality of objects;

store the first value for the first metadata in a metadata cache of the first node;

generate a first value for a first cookie, the first cookie corresponding to the first metadata;

store the first value for the first cookie in a local storage location of the first node; and

store the first value for the first cookie in a cluster-wide storage location.

Clause 14. The distributed object store of clause 13, wherein the first processing circuit is further configured, during a second time interval following the first time interval, to:

read a second value for the first cookie from the cluster-wide storage location;

read the first value for the first cookie from the local storage location of the first node; and

compare the second value for the first cookie to the first value for the first cookie.

Clause 15. The distributed object store of clause 14, wherein the first processing circuit is further configured to:

determine that the second value equals the first value; and

retrieve the first metadata from the metadata cache of the first node.

Clause 16. The distributed object store of clause 15, wherein the first processing circuit is further configured to access the first object based on the retrieved first metadata.

Clause 17. The distributed object store of clause 14, wherein a second processing circuit, of a second node of the plurality of nodes, is configured, during a third time interval following the first time interval and preceding the second time interval, to:

generate a second value for the first metadata;

store the second value for the first metadata in a metadata cache of the second node;

generate a second value for the first cookie, different from the first value for the first cookie;

store the second value for the first cookie in a local storage location of the second node; and

store the second value for the first cookie in the cluster-wide storage location.

Clause 18. The distributed object store of clause 17, wherein the first processing circuit is further configured, during a second time interval following the third time interval, to:

read the second value for the first cookie from the cluster-wide storage location;

read the first value for the first cookie from the local storage location of the first node; and

compare the second value for the first cookie to the first value for the first cookie.

Clause 19. The distributed object store of clause 18, wherein the first processing circuit is further configured to:

determine that the second value for the first cookie does not equal the first value for the first cookie; and

retrieve the first metadata from the persistent storage.

Clause 20. A distributed object store, comprising:

a plurality of nodes; and

persistent storage means, the persistent storage mean storing a plurality of objects and metadata for the objects,

the nodes comprising a plurality of processing circuits including a first processing circuit in a first node of the plurality of nodes, the first processing circuit being configured to:

generate a first value for first metadata for a first object of the plurality of objects;

store the first value for the first metadata in a metadata cache of the first node;

generate a first value for a first cookie, the first cookie corresponding to the first metadata;

store the first value for the first cookie in a local storage location of the first node; and

store the first value for the first cookie in a cluster-wide storage location.

Although aspects of some example embodiments of a metadata cache for a distributed object store have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Accordingly, it is to be understood that a metadata cache for a distributed object store constructed according to principles of this disclosure may be embodied other than as specifically described herein. The invention is also defined in the following claims, and equivalents thereof.

SECTION III Object Store Design Optimized for Mutable Data in a Distributed Object Store

The text in the present “Section III” of the Specification, including any reference numerals or characters and any references to figures, refer and correspond to the FIGS. 8-12 with the label “Section III”, and does not refer or correspond to the text in sections I, II, or IV, nor any of the reference numerals, characters, or figures with the labels on the figure sheets that have the label “Section I”, “Section II”, or “Section IV”. That is, each of the Sections I-IV in the present Specification should be interpreted in the context of the corresponding description in the same section and the figures labeled with the same section, respectively. Notwithstanding the foregoing, however, various aspects and inventive concepts of the various sections may be applied to aspects and inventive concepts of other sections.

FIELD

One or more aspects of embodiments according to the present disclosure relate to data storage, and more particularly to a distributed object store with segregated mutable data.

BACKGROUND

An object store may have various advantages, in terms of performance and reliability, over other storage system architectures. However, in some circumstances, modifying data in an object store may incur significant overhead.

Thus, there is a need for an improved system and method for operating an object store.

SUMMARY

According to an embodiment of the present disclosure, there is provided a method for modifying an object in a distributed object store, the distributed object store including one or more immutable object stores and one or more mutable object stores, each of the immutable object stores being configured to store atomic objects having a size equal to an atomic object size, the method including: allocating, for the object, a first region of a mutable object store of the one or more mutable object stores; storing a first portion of the object in the first region; and modifying the first portion of the object to form a modified first portion of the object, wherein the first region has a size larger than the atomic object size.

In some embodiments, the first region has a size larger than 1.5 times the atomic object size.

In some embodiments, wherein: the first portion has a size less than the atomic object size; the modifying of the first portion of the object includes adding data to the first portion of the object; and the method further includes: determining that the total size of the modified first portion of the object equals or exceeds the atomic object size; and moving a second portion of the object to an immutable object store of the one or more immutable object stores, the second portion: being a portion of the modified first portion of the object, and having a size equal to the atomic object size.

In some embodiments, the modifying of the first portion of the object includes storing, in the first region, modifications of the object, in a log structured format.

In some embodiments, the method further includes: determining that an elapsed time since a most recent modification was made to the first portion of the object has exceeded a threshold time; and moving the first portion of the object to an immutable object store of the one or more immutable object stores.

In some embodiments, the storing of the first portion of the object in the first region includes moving an atomic object from an immutable object store of the one or more immutable object stores to the first region, the atomic object being the first portion of the object.

In some embodiments, the method further includes receiving the first portion of the object as part of a storage request from a client application.

In some embodiments, the method further includes moving the modified first portion of the object to an immutable object store of the one or more immutable object stores, wherein the moving of the modified first portion of the object to the immutable object store includes transforming the modified first portion of the object.

In some embodiments, the transforming of the modified first portion of the object includes compressing the modified first portion of the object.

In some embodiments, the transforming of the modified first portion of the object includes encrypting the modified first portion of the object.

In some embodiments, the transforming of the modified first portion of the object includes encoding the modified first portion of the object for data resiliency.

According to an embodiment of the present disclosure, there is provided a distributed object store including: one or more immutable object stores; and one or more mutable object stores, the immutable object stores and the mutable object stores including a plurality of processing circuits, the processing circuits being configured to: store, in each of the immutable object stores, atomic objects having a size equal to an atomic object size, allocate, for an object, a first region of a mutable object store of the one or more mutable object stores; store a first portion of the object in the first region; and modify the first portion of the object to form a modified first portion of the object, wherein the first region has a size larger than the atomic object size.

In some embodiments, the first region has a size larger than 1.5 times the atomic object size.

In some embodiments: the first portion has a size less than the atomic object size; the modifying of the first portion of the object includes adding data to the first portion of the object; and the processing circuits are further configured to: determine that the total size of the modified first portion of the object equals or exceeds the atomic object size; and move a second portion of the object to an immutable object store of the one or more immutable object stores, the second portion: being a portion of the modified first portion of the object, and having a size equal to the atomic object size.

In some embodiments, the modifying of the first portion of the object includes storing, in the first region, modifications of the object, in a log structured format.

In some embodiments, the processing circuits are further configured to: determine that an elapsed time since a most recent modification was made to the first portion of the object has exceeded a threshold time; and move the first portion of the object to an immutable object store of the one or more immutable object stores.

In some embodiments, the storing of the first portion of the object in the first region includes moving an atomic object from an immutable object store of the one or more immutable object stores to the first region, the atomic object being the first portion of the object.

In some embodiments, the processing circuits are further configured to receive the first portion of the object as part of a storage request from a client application.

In some embodiments, the processing circuits are further configured to move the modified first portion of the object to an immutable object store of the one or more immutable object stores, wherein the moving of the modified first portion of the object to the immutable object store includes transforming the modified first portion of the object.

According to an embodiment of the present disclosure, there is provided a distributed object store including: one or more immutable object stores; and one or more mutable object stores, the immutable object stores and the mutable object stores including: persistent data storage means, and a plurality of processing circuits, the processing circuits being configured to: store, in each of the immutable object stores, atomic objects having a size equal to an atomic object size, allocate, for an object, a first region of a mutable object store of the one or more mutable object stores; store a first portion of the object in the first region; and modify the first portion of the object to form a modified first portion of the object, wherein the first region has a size larger than the atomic object size.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims, and appended drawings wherein:

FIG. 8 is a block diagram of a storage system, according to an embodiment of the present disclosure;

FIG. 9 is an activity graph, according to an embodiment of the present disclosure;

FIG. 10 is a command and data flow diagram, according to an embodiment of the present disclosure;

FIG. 11 is a data layout diagram, according to an embodiment of the present disclosure; and

FIG. 12 is a data layout diagram, according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

The detailed description set forth below in connection with the appended drawings is intended as a description of some example embodiments of a distributed object store with segregated mutable data provided in accordance with the present disclosure and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. It is to be understood, however, that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. As denoted elsewhere herein, like element numbers are intended to indicate like elements or features.

FIG. 8 shows a storage system, in some embodiments. Each of a plurality of object stores, or “key value stores” 105 provides persistent storage, through an object store interface. Each of a plurality of data managers 110 receives storage requests (e.g., data write, modify, or read requests) from clients such as client applications and workloads, and, for each such request, sends one or more corresponding input-output commands to one or more of the object stores 105. The object stores 105 may communicate with the data managers 110 through an object store interface. The data managers 110 may communicate with the clients through any suitable interface, e.g., through an object store interface, through a Server Message Block (SMB) interface, or through a Network File System (NFS) (e.g., NFSv3 or NFSv4) interface. The storage system of FIG. 8 may be referred to as a “distributed object store”. Each of the data managers 110 may be a node including one or more processing circuits. Each of the object stores 105 may also be a node, including one or more processing circuits, and further including persistent data storage means, e.g., one or more persistent storage devices such as solid-state drives. The data managers 110 and the object stores 105 may be connected together in a manner allowing each node to communicate with any of the other nodes, and the methods disclosed herein may be performed by one or more of the processing circuits, e.g., by several processing circuits that are in communication with each other.

Each of the object stores 105 may store data as objects which are accessed by the data managers 110 by providing an object identifier. This access process may differ from that of block or file storage, in which the application is responsible for organizing its logical data either as a set of fixed size blocks or into a structured organization like directories or files. In block or file storage, the application may lay out its data as a series of blocks in a block storage device, or as a directory and sub-directory hierarchy for a file system.

Each of the object stores 105, however, may instead provide a flattened access to data given an object identifier. This data along with the identifier may be referred to as an object. This approach offloads the mapping and data layout from the application onto the object store. Such offloading may be advantageous for applications that need to process large amounts of unstructured data. Such applications may, when using an object store, store their unstructured data as objects in the object store.

This natural advantage provided by an object store may be further enhanced in a distributed object store. Distributed object stores may provide scalable storage that can grow with the needs of the client applications. The performance of these applications may be dependent upon how fast the distributed object store can serve up the corresponding data given an object identifier. The performance of distributed object store may be determined primarily by two factors: (i) the ability of the object store to provide access to the object data given an object identifier, and (ii) the management of the underlying persistent storage for efficient data layout. As such, the efficiency and performance of a distributed object store depends on the efficiency of these two mechanisms; the efficiency of these two mechanisms, however, may be limited by the mutability aspects of the object.

Objects in an object store may have one or more phases in their lifetime during which the data set they represent is mutable, i.e., an object may become mutable during one of several scenarios, during its lifetime. First, when an object is initially being created, the application may provide the data for the object either as a single quantity of data or through a series of several separately delivered portions of the data, via a series of append-like operations. The object data may then be assembled from several portions. Second, an object may become mutable when the application performs application specific operations on the object, which may be part of the application's active working set. Such operations may be, for example, a partial update to a specific portion of the object, or another form of enhancement to the object that may be application specific. Third, an object may similarly become mutable as the distributed object store applies data transformations (or “transformation functions”) like encoding or erasure coding, compression, encryption or other specific functions for enhancing security, efficiency, resiliency and performance. FIG. 9 shows an example of a mutability graph for an object. A first mutable phase 205 occurs when the object is being created. The object then becomes immutable, and, at a later time, during a second mutable phase 210, one or more portions of the object again become mutable, as the application updates those portions of the object.

The mutable phases of the object, resulting from (i) an object being part of an application active working set or (ii) from the object store performing data transformation functions, may introduce several kinds of overhead in the object store. First, the ability to retrieve the data associated with an object given the object identifier may potentially require stitching together multiple data portions from several different distinct updates made to that object. This overhead can potentially impact the performance of the object store. Second, during the course of an object's lifetime different sections of an object may mutate (i.e., be modified). This requires the ability to ensure that the underlying persistent storage is utilized efficiently as data layout changes due to object mutability. Third, especially in an object store based on one or more solid-state drives, it may be advantageous to implement all modification operations as read modify update operations. Such read modify update operations may have an impact on the performance of both the object store as well as the underlying persistent storage medium.

At any point in time it may be the case that a small set of objects is in the mutable phase. If the mutable objects of this set are distributed throughout the entire object store, then this situation may degrade the overall performance of the distributed object store. This impact on the performance of the object store may affect the performance of the client applications that are using the distributed object store. As such, it may be advantageous to minimize the performance impact that the mutable data sets have on the entire distributed object store. It may further be advantageous to perform, without degrading the performance of the distributed object store, data preprocessing functions such as compression, encryption, and encoding on the active working set to improve storage efficiency, data resiliency, security and reliability.

In some embodiments, such advantageous characteristics are achieved using a distributed object store with segregated mutable data as discussed in further detail below. Such a system may minimize the impact of mutable data sets on the overall performance of the object store, and allow for the object store to stage the application working set (i.e., to stage the active working set, the data set on which the application is working) in higher performance persistent class memories, such as storage class memories (SCM).

In some embodiments, the distributed object store with segregated mutable data includes a separate store within the distributed object store that is optimized for handling mutable data. This special and independent area in the distributed object store may be referred to as a “mutable store”, as a “mutable object store”, or as a “staging store”. The staging store may be optimized to have high performance in (i) its ability to perform data transformation functions (ii) its ability to handle changes to mutable data by aggregating updates to the object data, and (iii) its ability to maintain consistent performance against multiple mutable operations through efficient data layout that is tuned for handling these operations. As such, the staging store helps the overall performance of the distributed object store by segregating mutable data sets from immutable data sets and localizing the handling of mutable operations on the distributed object store.

FIG. 10 illustrates the localization and segregation achieved through the staging store in a distributed object store. When an application 305 sends a storage request for creating or modifying an object, the object is created in, or a portion of the object is moved to and modified in, the staging store 310. Immutable objects are stored in the immutable object store 315. Objects may be stored in the immutable object store 315 in portions referred to as “atomic objects” to facilitate data layout in the immutable object store 315. The size of the atomic objects (which may be referred to as the “unit size” or the “atomic object size”) may be selected to be (i) sufficiently large that the overhead of assembling an object from a number of atomic objects is acceptable and (ii) sufficiently small that the amount of unused storage, that results from storing partial atomic objects (e.g., when the size of an object is not an integer multiple of the unit size) is acceptable. In some embodiments, the atomic objects are larger than 100 kB and smaller than 10 MB, e.g., they are 1 MB in size. In some embodiments, the client application may specify the unit size to be used. The staging store 310 includes an object map, for mapping atomic objects to storage addresses (e.g., in a distributed object store based on solid-state drives, the object map may map atomic objects to logical block addresses in the solid-state drives). The staging store 310 may be significantly smaller than the immutable object store 315; for example, the immutable object store 315 may be between 10 times and 100 times as large as the staging store 310. In some embodiments, the only data operations performed by the immutable object stores are the reading, writing, and erasing of entire atomic objects (each such atomic object having a size less than or equal to the unit size).

In operation, three principal steps or features, each discussed in further detail below, may be employed to implement the segregation of mutable data: (i) identifying the mutable data set, (ii) determining when mutable data in the staging store is ready to be transitioned out of the staging store and (iii) efficient data layout optimized for maintaining consistent performance against continuous updates to the mutating data set.

Identification of the mutable data set may proceed as follows. The distributed object store may define a certain data set size as the “unit size” or the “atomic object size”. The unit size may be configurable and may be programmed by the client application to meet its needs. The programming of this unit size may be explicit through a configuration mechanism, or tuned based on the workload pattern the application is issuing. Any application data object that is bigger than the unit size may be broken up into portions, or “atomic objects” each having a size equal to the unit size, and the atomic objects may be stored individually as multiple units. The distributed object store may, during read operations, serve up the data to the application, upon request, by retrieving all the related atomic objects and returning the data as the one single object that the application expects.

During write operations, the distributed object store may categorize data that is less than the size of the unit as the mutable data set; the distributed object store may expect that the application is likely to provide additional data related to that object through subsequent writes. In some embodiments, the redirection of the mutable data to the staging store is transparent to the application. The client application can access data for the object and the fact that the mutable data set for that object is internally being managed by the staging store may be transparent to the client application.

Determining when mutable data in the staging store is ready to be transitioned out of the staging store may proceed as follows. The criteria for this determination may be based on external and internal parameters. One external parameter may be the unit size. The unit size may be tuned to be application specific, allowing the distributed object store to take advantage of the nature of the application. The unit size parameter may be used by the object store to tag an incoming data set as a mutable data set. Internal parameters may allow the staging store to be used efficiently. For example, an internal parameter may be the current size of the mutable data set for an object in the staging store 310. As the application continues to write data into the object, the size of the mutable data set grows.

Storage requests received from the client application may include requests to store new objects, and requests to append to or overwrite portions of existing objects. The staging store may (instead of making updates to an object immediately upon receipt of such requests) store such requests in a log structured format, i.e., as a sequence of write or modify instructions. For example, if the client application (i) requests that a new object be stored, and sends, to the distributed object store, the first 100 kB of the object, and then (ii) requests that a 10 kB block in the middle of the object be replaced with new data (e.g., with a new 10 kB block sent by the client application), then the distributed object store may, instead of immediately overwriting the 10 kB block in the middle of the object with the new data, store the second request and leave the initially stored 100 kB block undisturbed. Once the object is ready to be transitioned to the immutable object store 315, the distributed object store may create, in the immutable object store 315, an object that incorporates all of the storage requests received from the client application.

Once the size of the mutable data set for the object reaches the configured unit size, the staging store may determine that the mutable data set is ready for data transformation. For example, if the unit size is 1 MB (1 megabyte), and an object having a total size of 10 MB is to be written, then if the application initially sends a 3.5 MB portion of data for the object, the distributed object store may (i) store three 1 MB atomic objects in the immutable object store 315 and store the remaining 0.5 MB of data in the staging store 310, anticipating that the application will be sending additional data for the object. Once the size of the portion in the staging store 310 is equal to the unit size, the application may move the portion to the immutable object store 315 as an additional atomic object, and begin creating the next atomic object in the staging store 310.

Another internal parameter that may be used is a threshold time, which may operate as the longest time (since the most recent modification was made to the mutable data set of an object in the staging store) that may be spent by the mutable data set of an object in the staging store. This threshold time parameter may be configured to be application specific. This parameter specifies a time for the application to perform subsequent IOs. This mechanism allows the staging store to classify as immutable some objects whose sizes are less than the configured unit size. This parameter may be combined with the unit size parameter, so that once the size of an object in the staging store 310 reaches the unit size, the object is moved (as an atomic object) into the immutable object store 315, even if the time since the last modification is less than the threshold time. This process may have the effect that the mutable data set for such objects does not exceed the configured unit size.

When the time spent in the staging store by the mutable data set without being modified by the client application exceeds the threshold time, the staging store determines that the data is ready for data transformation. The staging store may then perform any required data preprocessing, such as compression, encryption, or encoding (e.g., encoding for data resiliency, e.g., using an erasure code or data replication), and transition the data set out of the staging store.

An efficient data layout optimized for maintaining consistent performance against continuous updates to the mutating data set may be implemented as follows. As discussed above, the staging store may transition data sets out of the staging store 310 once each data set is marked ready for data transformation. This active management of the data set in the staging store allows the staging store to have an efficient data layout that may eliminate the overhead of the background storage management functions of garbage collection and re-laying of data on the storage. Storage management by the staging store may be tuned for mutable data sets that are expected to grow, shrink, or be updated. In some embodiments, the staging store 310 partitions the underlying storage into regions, (which may be referred to as “pre-allocated regions” or “chunks”) (e.g., regions each having a size of 2 MB) that are larger than (e.g., twice the size of) the unit size, and that can accommodate the configured unit size in the object store. Every new mutable data set that is redirected to the staging store may be allocated an entire region for its storage irrespective of the current size of that data set. For example, a mutable set may be allocated 1 MB or 2 MB even if it currently only occupies 0.5 MB of contiguous storage in logical block address (LBA) space. When the size in the staging store 310, of a mutable data set that is growing in size, exceeds the unit size, then the staging store 310 may transition, as mentioned above, a portion of the mutable data set to the immutable object store 315, where the portion has a size equal to the unit size. During this process of transitioning, the mutable data set may continue to grow. The allocation of a region having a size greater than the unit size may make it possible, in such a situation, for the mutable data set to continue growing after its size has exceeded the unit size. As such, in some embodiments, the mutable data set can grow and shrink as it mutates without requiring active storage management.

As such, in some embodiments a pre-allocated region may be allocated for an object to be modified, and the object may be modified, in the distributed object store, either (i) when it is first created (in which case a first portion of the object may be received as part of a storage request from a client application and stored in a pre-allocated region 405 in the staging store 310) or (ii) after being stored in an immutable object store 315, in response to a request, from a client application, to modify the object (in which case a first portion of the object may be may be moved to a pre-allocated region 405 in the staging store 310). The modifying of the object may include determining that the total size of the modified portion of the object equals or exceeds the atomic object size; and moving a second portion of the object to an immutable object store, the second portion being a portion of the modified first portion of the object, and the second portion having a size equal to the atomic object size.

FIG. 11 shows a data layout format that allows the staging store to efficiently handle data mutability. Each pre-allocated region 405 is larger, as shown, than the mutable data set it stores, allowing the mutable data set to grow without becoming fragmented. FIG. 12 shows the simplicity possible with the data layout scheme of FIG. 11 , in the ability to access the mutable data set based on its identifier even after the data set has undergone several mutations, which may have caused it to grow or shrink. As mentioned above, this storage layout may include and leverage persistent class memory technologies as the underlying persistent storage for the staging store 310. Such persistent class memory technologies may provide performance and byte addressability that (i) make possible significantly higher performance for the client application, for processing the active working set, and that (ii) decrease the overall wear on the underlying storage in the object store.

As used herein, “a portion of” something means “at least some of” the thing, and as such may mean less than all of, or all of, the thing. As such, “a portion of” a thing includes the entire thing as a special case, i.e., the entire thing is an example of a portion of the thing. As used herein, when a second number is “within Y %” of a first number, it means that the second number is at least (1−Y/100) times the first number and the second number is at most (1+Y/100) times the first number. As used herein, the term “or” should be interpreted as “and/or”, such that, for example, “A or B” means any one of “A” or “B” or “A and B”.

The term “processing circuit” is used herein to mean any combination of hardware, firmware, and software, employed to process data or digital signals. Processing circuit hardware may include, for example, application specific integrated circuits (ASICs), general purpose or special purpose central processing units (CPUs), digital signal processors (DSPs), graphics processing units (GPUs), and programmable logic devices such as field programmable gate arrays (FPGAs). In a processing circuit, as used herein, each function is performed either by hardware configured, i.e., hard-wired, to perform that function, or by more general-purpose hardware, such as a CPU, configured to execute instructions stored in a non-transitory storage medium. A processing circuit may be fabricated on a single printed circuit board (PCB) or distributed over several interconnected PCBs. A processing circuit may contain other processing circuits; for example, a processing circuit may include two processing circuits, an FPGA and a CPU, interconnected on a PCB.

As used herein, when a method (e.g., an adjustment) or a first quantity (e.g., a first variable) is referred to as being “based on” a second quantity (e.g., a second variable) it means that the second quantity is an input to the method or influences the first quantity, e.g., the second quantity may be an input (e.g., the only input, or one of several inputs) to a function that calculates the first quantity, or the first quantity may be equal to the second quantity, or the first quantity may be the same as (e.g., stored at the same location or locations in memory as) the second quantity.

It will be understood that, although the terms “first”, “second”, “third”, etc., may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section, without departing from the spirit and scope of the inventive concept.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concept. As used herein, the terms “substantially,” “about,” and similar terms are used as terms of approximation and not as terms of degree, and are intended to account for the inherent deviations in measured or calculated values that would be recognized by those of ordinary skill in the art.

As used herein, the singular forms “a” and “an” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. Further, the use of “may” when describing embodiments of the inventive concept refers to “one or more embodiments of the present disclosure”. Also, the term “exemplary” is intended to refer to an example or illustration. As used herein, the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively.

It will be understood that when an element or layer is referred to as being “on”, “connected to”, “coupled to”, or “adjacent to” another element or layer, it may be directly on, connected to, coupled to, or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being “directly on”, “directly connected to”, “directly coupled to”, or “immediately adjacent to” another element or layer, there are no intervening elements or layers present.

Any numerical range recited herein is intended to include all sub-ranges of the same numerical precision subsumed within the recited range. For example, a range of “1.0 to 10.0” or “between 1.0 and 10.0” is intended to include all subranges between (and including) the recited minimum value of 1.0 and the recited maximum value of 10.0, that is, having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0, such as, for example, 2.4 to 7.6. Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein and any minimum numerical limitation recited in this specification is intended to include all higher numerical limitations subsumed therein.

One or more embodiments according the present disclosure may include one or more characteristics of one or more of the following clauses (although embodiments are not limited thereto):

Clause 1. A method for modifying an object in a distributed object store, the distributed object store comprising one or more immutable object stores and one or more mutable object stores, each of the immutable object stores being configured to store atomic objects having a size equal to an atomic object size, the method comprising:

allocating, for the object, a first region of a mutable object store of the one or more mutable object stores;

storing a first portion of the object in the first region; and

modifying the first portion of the object to form a modified first portion of the object,

wherein the first region has a size larger than the atomic object size.

Clause 2. The method of clause 1, wherein the first region has a size larger than 1.5 times the atomic object size.

Clause 3. The method of clause 1, wherein:

the first portion has a size less than the atomic object size;

the modifying of the first portion of the object comprises adding data to the first portion of the object; and

the method further comprises:

-   -   determining that the total size of the modified first portion of         the object equals or exceeds the atomic object size; and     -   moving a second portion of the object to an immutable object         store of the one or more immutable object stores, the second         portion:         -   being a portion of the modified first portion of the object,             and         -   having a size equal to the atomic object size.

Clause 4. The method of clause 1 wherein the modifying of the first portion of the object comprises storing, in the first region, modifications of the object, in a log structured format.

Clause 5. The method of clause 1, further comprising:

determining that an elapsed time since a most recent modification was made to the first portion of the object has exceeded a threshold time; and

moving the first portion of the object to an immutable object store of the one or more immutable object stores.

Clause 6. The method of clause 1, wherein the storing of the first portion of the object in the first region comprises moving an atomic object from an immutable object store of the one or more immutable object stores to the first region, the atomic object being the first portion of the object.

Clause 7. The method of clause 1, further comprising receiving the first portion of the object as part of a storage request from a client application.

Clause 8. The method of clause 1, further comprising moving the modified first portion of the object to an immutable object store of the one or more immutable object stores,

wherein the moving of the modified first portion of the object to the immutable object store comprises transforming the modified first portion of the object.

Clause 9. The method of clause 8, wherein the transforming of the modified first portion of the object comprises compressing the modified first portion of the object.

Clause 10. The method of clause 8, wherein the transforming of the modified first portion of the object comprises encrypting the modified first portion of the object.

Clause 11. The method of clause 8, wherein the transforming of the modified first portion of the object comprises encoding the modified first portion of the object for data resiliency.

Clause 12. A distributed object store comprising:

one or more immutable object stores; and

one or more mutable object stores,

the immutable object stores and the mutable object stores comprising a plurality of processing circuits, the processing circuits being configured to:

-   -   store, in each of the immutable object stores, atomic objects         having a size equal to an atomic object size,     -   allocate, for an object, a first region of a mutable object         store of the one or more mutable object stores;     -   store a first portion of the object in the first region; and     -   modify the first portion of the object to form a modified first         portion of the object,

wherein the first region has a size larger than the atomic object size.

Clause 13. The distributed object store of clause 12, wherein the first region has a size larger than 1.5 times the atomic object size.

Clause 14. The distributed object store of clause 12, wherein:

the first portion has a size less than the atomic object size;

the modifying of the first portion of the object comprises adding data to the first portion of the object; and

the processing circuits are further configured to:

-   -   determine that the total size of the modified first portion of         the object equals or exceeds the atomic object size; and     -   move a second portion of the object to an immutable object store         of the one or more immutable object stores, the second portion:         -   being a portion of the modified first portion of the object,             and         -   having a size equal to the atomic object size.

Clause 15. The distributed object store of clause 12, wherein the modifying of the first portion of the object comprises storing, in the first region, modifications of the object, in a log structured format.

Clause 16. The distributed object store of clause 12, wherein the processing circuits are further configured to:

determine that an elapsed time since a most recent modification was made to the first portion of the object has exceeded a threshold time; and

move the first portion of the object to an immutable object store of the one or more immutable object stores.

Clause 17. The distributed object store of clause 12, wherein the storing of the first portion of the object in the first region comprises moving an atomic object from an immutable object store of the one or more immutable object stores to the first region, the atomic object being the first portion of the object.

Clause 18. The distributed object store of clause 12, wherein the processing circuits are further configured to receive the first portion of the object as part of a storage request from a client application.

Clause 19. The distributed object store of clause 12, wherein the processing circuits are further configured to move the modified first portion of the object to an immutable object store of the one or more immutable object stores, wherein the moving of the modified first portion of the object to the immutable object store comprises transforming the modified first portion of the object.

Clause 20. A distributed object store comprising:

one or more immutable object stores; and

one or more mutable object stores,

the immutable object stores and the mutable object stores comprising:

-   -   persistent data storage means, and     -   a plurality of processing circuits,

the processing circuits being configured to:

-   -   store, in each of the immutable object stores, atomic objects         having a size equal to an atomic object size,     -   allocate, for an object, a first region of a mutable object         store of the one or more mutable object stores;     -   store a first portion of the object in the first region; and     -   modify the first portion of the object to form a modified first         portion of the object,

wherein the first region has a size larger than the atomic object size.

Although aspects of some embodiments of a distributed object store with segregated mutable data have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Accordingly, it is to be understood that a distributed object store with segregated mutable data constructed according to principles of this disclosure may be embodied other than as specifically described herein. The invention is also defined in the following claims, and equivalents thereof.

SECTION IV Methods and Systems for Elastically Storing Objects in an Object Storage System

The text in the present “Section IV” of the Specification, including any reference numerals or characters and any references to figures, refer and correspond to the FIGS. 13-16 with the label “Section IV”, and does not refer or correspond to the text in sections I-III, nor any of the reference numerals, characters, or figures with the labels on the figure sheets that have the label “Section I”, “Section II”, or “Section III”. That is, each of the Sections I-IV in the present Specification should be interpreted in the context of the corresponding description in the same section and the figures labeled with the same section, respectively. Notwithstanding the foregoing, however, various aspects and inventive concepts of the various sections may be applied to aspects and inventive concepts of other sections.

FIELD

The present application generally relates to object storage systems, and more particularly to methods and system for elastically storing objects in an object storage system.

BACKGROUND

Some storage systems use a form of data persistence called block storage. This traditional design maps easily onto certain disk and flash media devices. In such a system, a unit of storage is the block, which is a fixed length sequence of bytes whose attribute is its address (that is, its location on a disk or flash device). A file system may be constructed using block storage by creating special blocks that contain information known as metadata about the use and location of other blocks that contain the data that users are interested in. The metadata includes things like the name given to the data by the user, as well as lists of block addresses where the data can be found.

Some applications such as, distributed applications that have multiple nodes working on a single namespace, and process large amounts of unstructured data. In such cases, it may be more desirable to use object storage, which are accessed by the application through the use of an object identifier. Unlike block or file storage systems, applications accessing object storage need not be responsible for organizing its logical data as a set of fixed size blocks or structured organizations like directories or files, thereby providing a flat access to data given an object identifier.

SUMMARY

According to an embodiment, a method of object storage on an object storage system is described. The method may include: dividing, by an extent manager, a memory device of the object storage system into a plurality of equal sized extents; identifying, by the extent manager, a first characteristic of a first data store of the object storage system comprising a first object; allocating, by the extent manager, a first extent corresponding to the first characteristic of the first object from the plurality of extents, to the first data store; storing, by the extent manager, the first object to the allocated first extent; identifying, by the extent manager, a second characteristic of a second data store of the object storage system comprising a second object; allocating, by the extent manager, a second extent corresponding to the second characteristic of the second object from the plurality of extents, to the second data store, wherein the second characteristic is different from the first characteristic; and storing, by the extent manager, the second object to the allocated second extent.

The method may further include retrieving, by the extent manager, status information of each of the plurality of extents from a superblock on the memory device.

The status information may include a state of each extent, a characteristic of each extent, and a location of each extent on the memory device.

The state of each extent may be selected from a free state that is devoid of objects and is available for allocation, an active state that is currently allocated, or a closed state that is unavailable for allocation.

The method may further include: updating, by the extent manager, the status information of the first extent in response to allocating the first extent to the first data store; and updating, by the extent manager, the status information of the second extent in response to allocating the second extent to the second data store.

The method may further include updating, by the extent manager, the state information corresponding to the first extent to a closed state in response to the first extent being filled to capacity with objects.

The superblock may include a table in a reserved portion on the memory device, wherein the table corresponds to each of the plurality of extents.

The characteristic of each extent may be selected from an immutable data extent, a mutable data extent, a metadata extent, or a staging extent.

The memory device may be a solid state drive (SSD).

The object storage system may include a plurality of SSDs connected in parallel.

According to another embodiment, an object storage system may be configured to store objects, and the object storage system may include a plurality of data stores, a memory device, and an extent manager. The extent manager may be configured to: divide the memory device of the object storage system into a plurality of equal sized extents; identify a first characteristic of a first data store of the plurality of data stores comprising a first object; allocate a first extent of the plurality of extents corresponding to the first characteristic of the first object, to the first data store; store the first object to the allocated first extent; identify a second characteristic of a second data store of the plurality of data stores comprising a second object; allocate a second extent of the plurality of extents corresponding to the second characteristic of the second object, to the second data store, wherein the second characteristic is different from the first characteristic; and store the second object to the allocated second extent.

The extent manager may be further configured to retrieve status information of each of the plurality of extents from a superblock on the memory device.

The status information may include a state of each extent, a characteristic of each extent, and a location of each extent on the memory device.

The state of each extent may be selected from a free state that is devoid of objects and is available for allocation, an active state that is currently allocated, or a closed state that is unavailable for allocation.

The extent manager may be further configured to: update the status information of the first extent in response to allocating the first extent to the first data store; and update the status information of the second extent in response to allocating the second extent to the second data store.

The extent manager may be further configured to update the state information corresponding to the first extent to a closed state in response to the first extent being filled to capacity with objects.

The superblock may include a table in a reserved portion on the memory device, wherein the table corresponds to each of the plurality of extents.

The characteristic of each extent may be selected from an immutable data extent, a mutable data extent, a metadata extent, or a staging extent.

The memory device may be a solid state drive (SSD).

The object storage system may include a plurality of SSDs connected in parallel.

Accordingly, the scope of the invention is defined by the claims, which are incorporated into this section by reference. A more complete understanding of embodiments of the invention will be afforded to those skilled in the art, as well as a realization of additional advantages thereof, by a consideration of the following detailed description of one or more embodiments. Reference will be made to the appended sheets of drawings that will first be described briefly.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 13 is a block diagram of an example object storage system, according to an embodiment of the present disclosure.

FIG. 14 illustrates an example of a solid state drive (SSD) that is divided into a plurality of equal sized extents, according to an embodiment of the present disclosure.

FIG. 15 is a block diagram of information flow in the object storage system, according to an embodiment of the present disclosure.

FIG. 16 illustrates an example of the contents of a superblock, according to an embodiment of the present disclosure.

FIG. 17 is a flow chart of a method for object storage in an object storage system, according to an embodiment of the present disclosure.

Embodiments of the present disclosure and their advantages are best understood by referring to the detailed description that follows. Unless otherwise noted, like reference numerals denote like elements throughout the attached drawings and the written description, and thus, descriptions thereof will not be repeated. In the drawings, the relative sizes of elements, layers, and regions may be exaggerated for clarity.

DETAILED DESCRIPTION

Hereinafter, example embodiments will be described in more detail with reference to the accompanying drawings. The present invention, however, may be embodied in various different forms, and should not be construed as being limited to only the illustrated embodiments herein. Rather, these embodiments are provided as examples so that this disclosure will be thorough and complete, and will fully convey the aspects and features of the present invention to those skilled in the art. Accordingly, processes, elements, and techniques that are not necessary to those having ordinary skill in the art for a complete understanding of the aspects and features of the present invention may not be described.

Traditional techniques for storing data include using block or file storage systems that rely on some structured organization so that an application that uses the data may easily find and retrieve the stored data. Such structured organization may include fixed size blocks with the data being stored in directories or files. Object storage is another technique for storing data that does not rely on structured organization like in the block or file storage systems. Instead, objects, which are stored in an object store (also referred to as “key value store (KVS)” or “data store” and used interchangeably herein) utilizes a so-called “key” and a “value.” That is, the data is stored as an object (or “value”) and the object has an associated identifier (or “key”) that specifies the specific location of the object in the storage. Thus, when data is stored, the object is given a key and the object is dumped in the storage in an unstructured manner. When it is desired to retrieve the data, the key provides the specific location of the object and therefore the object may be retrieved from storage. Applications that use large amounts of unstructured data in distributed systems may benefit from systems that use object storage systems.

FIG. 13 is a block diagram of an example object storage system, according to an embodiment. The object storage system 100 may be a parallel computer that includes at least a data manager node and a key value store (KVS) node. The KVS node may include one or more solid state drives (SSD) connected thereto, which is used to store data as objects. More particularly, this example illustrates four data manager nodes and four KVS nodes. Each one of the data manager nodes 102-105 is connected to each one of the KVS nodes 106-109 as illustrated, and therefore any data manager node 102-105 is able to store objects to an SSD in any of the KVS nodes 106-109, and any data manager node 102-105 is able to retrieve objects from an SSD in any of the KVS nodes 106-109. Thus, the data platform uses a shared storage concept where the all available storage from all of the SSDs in the KVS nodes can be used by any data manager nodes. As more storage is needed or desired, additional KVS nodes with SSDs can be added, and similarly, when more compute is needed or desired, additional data manager nodes can be added. Accordingly, KVS nodes and data manger nodes in the object storage system 100 according to the embodiments of the present disclosure are independently scalable.

In some embodiments, an application 101 may communicate with the data manager nodes 102-105 to write and read information to the SSDs. By way of example and not of limitation, the application can be any Network File System (NFS) based application such as, for example, NFS v3, NFS v4, SMB, or S3 objects. The application 101 may communicate with the data manager nodes 102-105, and the data manager nodes 102-105 may direct objects from the application 101 to specific KVS nodes based on various parameters of a data store, such as the workload, characteristics, and other attributes. Once the KVS node is selected, the objects may be stored in the SSD of the selected KVS node.

In some embodiments, the performance of distributed applications is dependent on the efficiency of the distributed object store serving the data given an object identifier. In order to provide consistent latency and performance, the object storage systems use a shared storage concept where the plurality of data manager nodes are responsible for serving the application objects while the underlying KVS nodes host the data stores and export the objects to the data manager nodes as shared storage. In some embodiments, it is the responsibility of the KVS nodes to efficiently store different classes of objects to the underlying data platform. In doing so, as client applications write objects to the KVS nodes, each object includes a corresponding external metadata associated with the object. Similarly, the data manger nodes generate internal metadata to keep track of the objects as they are stored in the KVS nodes. In some embodiments, internal metadata can include, for example, location where the object is written, the size of the object, the number of slices of the object, etc.

In some embodiments, the characteristics of metadata is different from immutable data or staging data. For example, metadata may be sequential in nature, whereas immutable data may be more random in nature. Thus, in some systems, metadata is stored separately from the immutable data, for example, in a metadata server, which can add to cost and space of the system. However, the use of a separate metadata server may be avoided by co-locating both data and metadata from the same underlying SSD according to various embodiments of the present disclosure. That is, a certain percentage of space on the SSD may be reserved for the metadata while the remaining portions of the SSD may be used for the immutable data or staging data. Moreover, by carving out a portion of the SSD, the size of the metadata can proportionally grow as the size of the data grows without additional provisioning as the distributed object store scales out.

In some embodiments, every SSD may include a plurality of data stores. For example, the SSD may have three data stores: a mutable data store, an immutable data store, and a metadata store. However, the SSD is not necessarily limited to just three stores. In other embodiments, the SSD may have more or fewer stores, for example, two or four or five stores, based on the needs of the application.

In some embodiments, each store may have different characteristics. For example, one store may have the characteristics of a staging store, whereas another store may have the characteristics of an immutable data store, while yet another store may have the characteristics of a metadata store.

FIG. 14 illustrates an example of an SSD that is divided into a plurality of equal sized extents. Instead of statically dividing an SSD into different predefined sizes for each different object store, the SSD may be divided into equal sized chunks called extents. According to various embodiments, an extent is a unit of space allocation of the SSD to a particular object store in the SSD where the space is managed. An extent may be configured for the different types of stores including, a mutable data store, an immutable data store, or a metadata store.

In some embodiments, the SSD may be divided into the extents by an object storage system software running on the KVS node, and formatting the SSD. During the formatting, a portion of the SSD may be set aside as a reserve, e.g., 64 GB, and the rest of the SSD may be subdivided into equal size extents, e.g., 1 GB chunks. In some embodiments, every extent is represented by a state that corresponds to the availability of the extent for use by a store. Initially, when the SSD is formatted and the extents are created, all of the extents are in a free state because none of the extents are used by or allocated to any of the object stores. Thus, the free state extents are available for use by an object store and may be selected by an extent manager (which will be described in more detail later) to be allocated to and used by an object store. Once the extent is selected by the store and the objects are stored in the extent, then the state of that extent is changed to active. Accordingly, an active state is available only to the allocated object store and the extent and cannot be allocated to any other object stores. In some embodiments, when an active extent is filled with objects and the capacity of that extent is full, then the state of that extent is changed to closed. Yet in some embodiments, the state may be changed to closed when all of the objects are stored in the extent and no further objects are to be stored even though the extent has not reached its capacity yet. An extent that is in a closed state also cannot be allocated to any object stores and no further objects can be stored in it. Accordingly, each store is configured to select its own extent and only that store can use the selected extent.

According to an embodiment, any capacity size may be selected for each of the plurality of extents when it is subdivided during formatting as long as all of the extents are equal in size. Yet in some embodiments, the size of the extent is selected to be sufficiently large to store objects without reaching capacity too quickly such that multiple extents are used to store the remainder of the objects. On the other hand, in some embodiments, the size of the extent is selected to be sufficiently small such that enough extents are created on the SSD to store objects from different object stores. In other words, once an extent is claimed or allocated to one object store, it can no longer be allocated to other object stores. Therefore, once an object store has claimed one extent, when a subsequent object store wants to store an object, a different extent is allocated to the subsequent object store. For example, if the size of the extent is so large that the SSD includes only two extents (e.g., a 20 GB SSD that has two 10 GB extents), then when two object stores claim each of the two extents, there will be no other available extents that can be claimed by the other object stores. On the other hand, if the size of each extent is made too small (e.g., a 20 GB SSD that has 50 400 MB extents), then an extent may fill up quickly and therefore may affect performance (e.g., inefficiency due to having to used multiple extents to store objects). Thus, to achieve optimal efficiency, an extent should be sized such that it is not too small but not too large. In one embodiment, each of the extents may be 1 GB in size, but a larger or smaller extent may be more optimal depending on the application and other considerations.

In this manner, an SSD may be organized into a plurality of equal sized extents, and each extent may be characterized based on the characteristics or type of the object store that has claimed the particular extent. For example, FIG. 14 illustrates an SSD that is divided into a plurality of equal sized (e.g., 1 GB) extents, wherein some extents are in either active or closed states because they are already claimed by an object store and some extents are in a free state 208 because they have not yet been claimed and are not being used by any of the object stores. For example, extents 202, 204 206 are in an active or closed state and already includes objects (e.g., metadata, mutable data, or immutable data).

According to an embodiment, a staging store may be used as a staging area for atomic objects before they are moved into immutable store. Thus, the objects are generally smaller in size compared to the immutable data store. For example, an atomic object may be 1 MB in size. The metadata store is used to store metadata of the object and this information is frequently updated because metadata is updated or changed every time an object is accessed (e.g., stored or retrieved). Moreover, metadata is usually small in size (e.g., about 64 bytes to a few kilobytes) but can grow larger. In some embodiments, about 5% of the SSD capacity is used by the metadata store and about another 5% is used by the staging store. The remaining SSD capacity of about 90% is used by the immutable data store, which is where the majority of the objects reside. Moreover, immutable data cannot be appended or overwritten in place. Thus, if it is desired to update the immutable data, then a new object is written (in another free extent) and the old object is deleted. On the other hand, metadata is mutable, and therefore may be appended or overwritten as desired. Thus, different extents having different characteristics may all be co-located on the same SSD, thereby not having to separate the different types of extents on to different SSDs. Moreover, the capacity of the object that can be stored on the SSD is not limited or pre-determined. That is, when more storage space is needed to store a particular type of object (e.g., metadata, immutable data, mutable data, staging data), then a further extent can be characterized to correspond to that type of object (e.g., metadata, immutable data, mutable data, staging data) and allocated to the store.

Accordingly, various embodiments of the present disclosure describe techniques for operating an object storage system. FIG. 15 is a block diagram of information flow in the object storage system, according to an embodiment of the present disclosure. While this example embodiment includes three object stores, a mutable data store 302A, an immutable data store 302B, and a metadata store 302C, fewer or more stores may be included in other embodiments. The data stores 302A-302C may be connected to the SSD 310 that is divided into a plurality of equal sized extents for the objects to be stored. Additionally, while a single SSD 310 is illustrated in the example of FIG. 15 , the SSD 310 may actually represent a plurality of SSDs connected together in a shared storage configuration.

According to an embodiment, the data stores 302A-302C are configured to communicate with an extent manager 304 and the extent manager 304 may communicate with a superblock manager 306 and/or an SSD access manager 308. The SSD access manager 308 may then communicate with the SSD 310 to store objects in the SSD.

In more detail, the extent manager 304 manages the extents on the SSD 310 by directing the flow of objects from the data stores 302A-302C to the SSD 310. The extent manager 304 may track the ownership and capacity information of each extent. For example, the extent manager 304 maintains an in-memory data structure to track the free, active, and closed state of each extent. Thus, when the data store desires to save objects on the SSD, the data store communicates with the extent manager 304, then the extent manager 304 looks up the status of the extents on a superblock (which will be explained in more detail later), and finally selects an extent based on the availability of the extent and the characteristic of the object (e.g., mutable data, immutable data, or metadata).

In some embodiments, the extent manager 304 may persist the information at a well-known logical block address (LBA) in a region of the SSD referred to as the superblock. According to an embodiment, the superblock is similar to a master boot record for each SSD in the KVS node. FIG. 16 illustrates an example of a superblock. The superblock may be a portion of the SSD that is allocated to store status information of the SSD. For example, the superblock in FIG. 16 is allocated 128 KB of the SSD that is further divided into a 4 KB portion and a 124 KB portion. The 124 KB portion represents a map of all of the extents on the SSD, whereby each block of the superblock corresponds to an extent. For example, some blocks indicate that certain corresponding extents are allocated to a metadata store, a mutable data store, or an immutable data store, and therefore are in either an active or a closed state. Other blocks indicate that corresponding extents are free and unallocated or used by any store.

According to another embodiment, the 4 KB portion of the superblock may be a reserved portion and may contain extent metadata. Accordingly, the superblock may be continuously updated with status information of the extent so that at any given moment, the superblock can provide information to the superblock manager 306 when requested. For example, when the computer system is initially booted, the superblock manager 306 may read the information on the superblock to understand the layout and the status of the extents on the SSD. By reading the superblock, the superblock manager 306 is able to determine the state of each extent, and if an extent is allocated to an object store, which object store the extent is allocated to.

FIG. 17 is a flow chart of a method for object storage in an object storage system. As described earlier, the SSD may be initially formatted during which the SSD of the object storage system may be divided by an extent manager into equal sized extents. (502). In some embodiments, an application (e.g., a software application) may desire to store data as objects in the SSD. Thus, the application may communicate with an extent manager to identify a first characteristic of a first data store of the object storage system comprising a first object (504). The first characteristic of the first data store may correspond to the characteristic of the first object, for example, metadata, immutable data, mutable data, or staging data. Once the first characteristic of the first data store is identified, the extent manager may allocate a first extent from among the plurality of extents, that corresponds with the characteristic of the first object to the first data store (506). Accordingly, by allocating the first extent to the first data store, the state of the first extent may be changed to active. Once the corresponding first extent is allocated to the first data store, the extent manager may coordinate the storing of the first object to the allocated first extent (508). In this manner, an object having the first characteristic from the first data store may be stored in the first extent on the SSD.

According to an embodiment, a similar process may be repeated for storing another data as an object from the application where the object has a different characteristic. Thus, a second data store of the object storage system may include a second object that has a second characteristic, and the characteristics of the second data store may be identified by the extent manager (510). Next, the extent manager may allocate a second extent corresponding to the second characteristic of the second object from the plurality of extents, to the second data store, and the second characteristic may be different from the first characteristic (512). For example, if the first characteristic is an immutable data, then the second characteristic may be a metadata. Once the second extent is allocated to the second data store, the second object may now be stored in the second extent (514). Accordingly, a plurality of objects having different characteristics may be co-located and stored on the same SSD by storing them in corresponding extents on the SSD.

In this manner, the application may communicate with the data stores of the object storage system, and the extent manager may facilitate the storing of the objects in to various extents on the SSD. More particularly, based on determining the data store characteristics, the extent manager communicates with the superblock manager, and the superblock manager looks up the contents of the superblock to find an available extent (e.g., an extent that is a free state) that corresponds to the characteristics of the data store. For example, if the data store is a metadata store, then a metadata extent is selected, if the data store is an immutable data store, then an immutable data extent is selected, and if the data store is a mutable data store, then a mutable data extent is selected.

According to an embodiment, the superblock manager provides the specific location of the selected extent on the SSD and provides this information to the extent manager. The extent manager then updates the state of the selected extent to active and the objects from the data store are provided to the SSD access manager which writes the object to the SSD using information (e.g., SSD location) derived from the extent manager. As the object is stored in the selected extent, if the extent becomes full, then the extent manager will change the state of that extent to closed and will select a different extent that is free, and the remainder of the objects will continue to be stored until all objects are stored. According to an embodiment, even though the data store has finished storing all objects to the selected extent, and even though free space may still be available in this extent, this extent may not be allocated to another data store because it is already allocated and the state remains either active or closed.

According to an embodiment, as more objects are stored in the SSD, more metadata is also stored in the same co-located SSD, thereby not having to rely on a separate metadata server to store the metadata. Thus, as more storage is needed for the different types of objects having different characteristics (e.g., metadata, immutable data, mutable data, staging data), more extents corresponding to those characteristics may be selected and allocated. Therefore, the storage capacity for the objects may be elastically and independently grown as determined by demand. Furthermore, additional SSDs may be added as desired or as more storage capacity is needed. In this manner, the capacity of the data store is elastic.

As described above, an extent that is already allocated to a data store is no longer available to other data stores even if free space is still available in those extents. Thus, according to an embodiment, the free space can be reclaimed by defragmenting the extents. During defragmentation, any extent that is not completely fill is considered and the objects are moved to a different free extent. Additionally, objects from other extents are moved into this same extent until this extent is filled to capacity. Once the objects have been moved over to this extent, the objects from the other partially filled extents may be deleted and the state may be changed back to a free state again. By performing this defragmentation process, the objects from the partially filled extents may be consolidated to better utilize the extents and free up unused space for use by other data stores.

One or more embodiments according the present disclosure may include one or more characteristics of one or more of the following clauses (although embodiments are not limited thereto):

Clause 1. A method of object storage on an object storage system, the method comprising:

dividing, by an extent manager, a memory device of the object storage system into a plurality of equal sized extents;

identifying, by the extent manager, a first characteristic of a first data store of the object storage system comprising a first object;

allocating, by the extent manager, a first extent corresponding to the first characteristic of the first object from the plurality of extents, to the first data store;

storing, by the extent manager, the first object to the allocated first extent;

identifying, by the extent manager, a second characteristic of a second data store of the object storage system comprising a second object;

allocating, by the extent manager, a second extent corresponding to the second characteristic of the second object from the plurality of extents, to the second data store, wherein the second characteristic is different from the first characteristic; and

storing, by the extent manager, the second object to the allocated second extent.

Clause 2. The method of clause 1, further comprising retrieving, by the extent manager, status information of each of the plurality of extents from a superblock on the memory device.

Clause 3. The method of clause 2, wherein the status information comprises a state of each extent, a characteristic of each extent, and a location of each extent on the memory device.

Clause 4. The method of clause 3, wherein the state of each extent is selected from a free state that is devoid of objects and is available for allocation, an active state that is currently allocated, or a closed state that is unavailable for allocation.

Clause 5. The method of clause 4, further comprising:

updating, by the extent manager, the status information of the first extent in response to allocating the first extent to the first data store; and

updating, by the extent manager, the status information of the second extent in response to allocating the second extent to the second data store.

Clause 6. The method of clause 5, further comprising updating, by the extent manager, the state information corresponding to the first extent to a closed state in response to the first extent being filled to capacity with objects.

Clause 7. The method of clause 5, wherein the superblock comprises a table in a reserved portion on the memory device, wherein the table corresponds to each of the plurality of extents.

Clause 8. The method of clause 3, wherein the characteristic of each extent is selected from an immutable data extent, a mutable data extent, a metadata extent, or a staging extent.

Clause 9. The method of clause 1, wherein the memory device is a solid state drive (SSD).

Clause 10. The method of clause 9, wherein the object storage system comprises a plurality of SSDs connected in parallel.

Clause 11. An object storage system configured to store objects, the object storage system comprising a plurality of data stores, a memory device, and an extent manager, wherein the extent manager is configured to:

divide the memory device of the object storage system into a plurality of equal sized extents;

identify a first characteristic of a first data store of the plurality of data stores comprising a first object;

allocate a first extent of the plurality of extents corresponding to the first characteristic of the first object, to the first data store;

store the first object to the allocated first extent;

identify a second characteristic of a second data store of the plurality of data stores comprising a second object;

allocate a second extent of the plurality of extents corresponding to the second characteristic of the second object, to the second data store, wherein the second characteristic is different from the first characteristic; and

store the second object to the allocated second extent.

Clause 12. The system of clause 11, wherein the extent manager is further configured to retrieve status information of each of the plurality of extents from a superblock on the memory device.

Clause 13. The system of clause 12, wherein the status information comprises a state of each extent, a characteristic of each extent, and a location of each extent on the memory device.

Clause 14. The system of clause 13, wherein the state of each extent is selected from a free state that is devoid of objects and is available for allocation, an active state that is currently allocated, or a closed state that is unavailable for allocation.

Clause 15. The system of clause 14, wherein the extent manager is further configured to:

update the status information of the first extent in response to allocating the first extent to the first data store; and

update the status information of the second extent in response to allocating the second extent to the second data store.

Clause 16. The system of clause 15, wherein the extent manager is further configured to update the state information corresponding to the first extent to a closed state in response to the first extent being filled to capacity with objects.

Clause 17. The system of clause 15, wherein the superblock comprises a table in a reserved portion on the memory device, wherein the table corresponds to each of the plurality of extents.

Clause 18. The system of clause 13, wherein the characteristic of each extent is selected from an immutable data extent, a mutable data extent, a metadata extent, or a staging extent.

Clause 19. The system of clause 11, wherein the memory device is a solid state drive (SSD).

Clause 20. The system of clause 19, wherein the object storage system comprises a plurality of SSDs connected in parallel.

It will be understood that, although the terms “first,” “second,” “third,” etc., may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section described below could be termed a second element, component, region, layer or section, without departing from the spirit and scope of the present invention.

It will be understood that when an element or layer is referred to as being “on,” “connected to,” or “coupled to” another element or layer, it can be directly on, connected to, or coupled to the other element or layer, or one or more intervening elements or layers may be present. In addition, it will also be understood that when an element or layer is referred to as being “between” two elements or layers, it can be the only element or layer between the two elements or layers, or one or more intervening elements or layers may also be present.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting of the present invention. As used herein, the singular forms “a” and “an” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and “including,” when used in this specification, specify the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list.

As used herein, the terms “substantially,” “about,” and similar terms are used as terms of approximation and not as terms of degree, and are intended to account for the inherent deviations in measured or calculated values that would be recognized by those of ordinary skill in the art. Further, the use of “may” when describing embodiments of the present invention refers to “one or more embodiments of the present invention.” As used herein, the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively.

The electronic or electric devices and/or any other relevant devices or components according to embodiments of the present invention described herein may be implemented utilizing any suitable hardware, firmware (e.g. an application-specific integrated circuit), software, or a combination of software, firmware, and/or hardware. For example, the various components of these devices may be formed on one integrated circuit (IC) chip or on separate IC chips. Further, the various components of these devices may be implemented on a flexible printed circuit film, a tape carrier package (TCP), a printed circuit board (PCB), or formed on one substrate. Further, the various components of these devices may be a process or thread, running on one or more processors, in one or more computing devices, executing computer program instructions and interacting with other system components for performing the various functionalities described herein. The computer program instructions are stored in a memory which may be implemented in a computing device using a standard memory device, such as, for example, a random access memory (RAM). The computer program instructions may also be stored in other non-transitory computer readable media such as, for example, a CD-ROM, flash drive, or the like. Also, a person of skill in the art should recognize that the functionality of various computing devices may be combined or integrated into a single computing device, or the functionality of a particular computing device may be distributed across one or more other computing devices without departing from the spirit and scope of the example embodiments of the present invention.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification, and should not be interpreted in an idealized or overly formal sense, unless expressly so defined herein.

Embodiments described herein are examples only. One skilled in the art may recognize various alternative embodiments from those specifically disclosed. Those alternative embodiments are also intended to be within the scope of this disclosure. As such, the embodiments are limited only by the following claims and their equivalents. 

What is claimed is:
 1. A storage system, comprising: a plurality of object stores; and a plurality of data managers, connected to the object stores, the plurality of data managers comprising a plurality of processing circuits, a first processing circuit of the plurality of processing circuits being configured to process primarily input-output operations, and a second processing circuit of the plurality of processing circuits being configured to process primarily input-output completions.
 2. The storage system of claim 1, wherein the second processing circuit is configured to execute a single software thread.
 3. The storage system of claim 1, wherein the processing of the input-output operations comprises writing data for data resiliency.
 4. The storage system of claim 3, wherein the writing of data for data resiliency comprises: writing first data to a first object store of the plurality of object stores; and writing the first data to a second object store of the plurality of object stores.
 5. The storage system of claim 3, wherein the writing of data for data resiliency comprises: writing first data to a first object store of the plurality of object stores; and writing parity data corresponding to the first data to a second object store of the plurality of object stores.
 6. The storage system of claim 1, wherein the first processing circuit is configured to process at least 10 times as many input-output operations as input-output completions.
 7. The storage system of claim 6, wherein the second processing circuit is configured to process at least 10 times as many input-output completions as input-output operations.
 8. The storage system of claim 1, wherein the first processing circuit is configured to process only input-output operations.
 9. The storage system of claim 1, wherein the second processing circuit is configured to process only input-output completions.
 10. The storage system of claim 9, wherein the first processing circuit is configured to process only input-output operations.
 11. The storage system of claim 1, wherein the first processing circuit is a first core of a first data manager, and the second processing circuit is a second core of the first data manager.
 12. The storage system of claim 1, wherein: a first data manager of the plurality of data managers comprises a plurality of cores; a first subset of the plurality of cores is configured to process primarily input-output operations and a second subset of the plurality of cores is configured to process primarily input-output completions; and the first subset includes at least 10 times as many cores as the second subset.
 13. The storage system of claim 12, wherein the first subset includes at most 100 times as many cores as the second subset.
 14. A method for operating a storage system comprising a plurality of object stores and a plurality of data managers, the data managers being connected to the object stores, the method comprising: receiving a plurality of contiguous requests to perform input-output operations; processing, by a plurality of processing circuits of the data managers, the input-output operations; and processing, by the plurality of processing circuits of the data managers, a plurality of input-output completions corresponding to the input-output operations, wherein: a first processing circuit of the plurality of processing circuits processes primarily input-output operations, and a second processing circuit of the plurality of processing circuits processes primarily input-output completions.
 15. The method of claim 14, wherein the second processing circuit is configured to execute a single software thread.
 16. The method of claim 14, wherein the processing of the input-output operations comprises writing data for data resiliency.
 17. The method of claim 16, wherein the writing of data for data resiliency comprises: writing first data to a first object store of the plurality of object stores; and writing the first data to a second object store of the plurality of object stores.
 18. The method of claim 16, wherein the writing of data for data resiliency comprises: writing first data to a first object store of the plurality of object stores; and writing parity data corresponding to the first data to a second object store of the plurality of object stores.
 19. A storage system, comprising: means for storing objects; and a plurality of data managers, connected to the means for storing objects, the plurality of data managers comprising a plurality of processing circuits, a first processing circuit of the plurality of processing circuits being configured to process primarily input-output operations, and a second processing circuit of the plurality of processing circuits being configured to process primarily input-output completions.
 20. The storage system of claim 19, wherein the second processing circuit is configured to execute a single software thread. 